Playing with iNaturalist taxonomy data in MariaDB

You can download the iNaturalist taxonomy data (updated monthly) from https://www.inaturalist.org/pages/developers under “iNaturalist Taxonomy DarwinCore Archive” (https://www.inaturalist.org/taxa/inaturalist-taxonomy.dwca.zip). After unzipping, you’ll find “taxa.csv”. Here’s one way of importing this into MariaDB (and I guess MySQL too). I know that the schema for my table isn’t optimal. (I would love to hear suggestions on how to improve this.)

CREATE DATABASE inaturalist;
USE inaturalist;

CREATE TABLE taxonomy
  (id INT NOT NULL,
  taxonID VARCHAR(255),
  identifier VARCHAR(255),
  parentNameUsageID VARCHAR(255),
  kingdom VARCHAR(255),
  phylum VARCHAR(255),
  class VARCHAR(255),
  `order` VARCHAR(255),
  family VARCHAR(255),
  genus VARCHAR(255),
  specificEpithet VARCHAR(255),
  infraspecificEpithet VARCHAR(255),
  modified VARCHAR(255),
  scientificName VARCHAR(255),
  taxonRank VARCHAR(255),
  `references` TEXT);

LOAD DATA INFILE "/absolute/path/to/taxa.csv"
INTO TABLE taxonomy
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;

Now that we have it imported, let’s have a little fun. How many species per kingdom?

select kingdom, count(*) as num_species from taxonomy where taxonRank = 'species' group by kingdom order by num_species desc;
+-----------+-------------+
| kingdom   | num_species |
+-----------+-------------+
| Animalia  |      710265 |
| Plantae   |      263359 |
| Fungi     |       45540 |
| Chromista |        6699 |
| Bacteria  |        1829 |
| Protozoa  |        1485 |
| Viruses   |         424 |
| Archaea   |          82 |
|           |           1 |
+-----------+-------------+
9 rows in set (0.698 sec)
  • “Chromista” is new to me. Wikipedia says: “Chromista is a proposed but polyphyletic biological kingdom, refined from the Chromalveolata, consisting of single-celled and multicellular eukaryotic species that share similar features in their photosynthetic organelles (plastids). It includes all eukaryotes whose plastids contain chlorophyll c and are surrounded by four membranes.”

That one species with an empty kingdom seems to be a data error:

select id, scientificName from taxonomy where kingdom = '' and taxonRank = 'species';
+---------+--------------------+
| id | scientificName |
+---------+--------------------+
| 1558845 | Caelicola rostrata |
+---------+--------------------+
1 row in set (0.444 sec)
  • It’s a moth, per Wikipedia: “Caelicola is a monotypic moth genus in the subfamily Lymantriinae. Its only species, Caelicola rostrata, is found in the Brazilian state of Amazonas.”
  • I’m working on reporting this now.

The data include 885 families of plants:

select count(distinct family) from taxonomy where kingdom = 'Plantae' and taxonRank = 'species';
+------------------------+
| count(distinct family) |
+------------------------+
|                    885 |
+------------------------+
1 row in set (0.468 sec)

The top ten of those by species count:

select family, count(*) num_species from taxonomy where kingdom = 'Plantae' and taxonRank = 'species' group by family order by num_species desc limit 10;
+---------------+-------------+
| family        | num_species |
+---------------+-------------+
| Asteraceae    |       24664 |
| Orchidaceae   |       17393 |
| Fabaceae      |       16494 |
| Poaceae       |        7805 |
| Rubiaceae     |        7598 |
| Lamiaceae     |        5992 |
| Euphorbiaceae |        4828 |
| Myrtaceae     |        4639 |
| Ericaceae     |        4435 |
| Cyperaceae    |        4315 |
+---------------+-------------+
10 rows in set (0.553 sec)

I’m looking forward to doing some visualizations based on these data and to seeing what we can do combining it with other sources, but I’ll leave that for another time.


Comments

One response to “Playing with iNaturalist taxonomy data in MariaDB”

  1. Bug report for the species with an empty kingdom (really empty everything): https://forum.inaturalist.org/t/caelicola-rostrata-is-missing-kingdom-phylum-class-order-family-in-taxa-export/53936 Apparently, it’s expected to be fixed in the next data dump since it just reflected a temporary issue (partial update?) in the underlying DB.

Leave a Reply

Your email address will not be published. Required fields are marked *