Skip to content


Notebook link where you can find all the graphs.

Data created on 17.10.2022 at 19:32:45

Data updated on 17.10.2022 at 19:32:45

Support by Ontologies

NCBI Taxonomy provides an excellent source for organisms.

Data Sanitisation and Missing Values

Organisms are found only in metabolomics-related repositories, i.e., MTBLS and MW.

Field TypeField NameValues ReadabilityUnitMissing
MTBLSdedicatedOrganismontology-drivennoneThe field is not provided; or the value is provided as N/A or other similar expressions; or the study "assays" value is "null"; or the organism is not found in NCBI taxonomy.
MWdedicatedSubject Speciesfree textnoneThe field is not provided; or the value is provided as N/A or other similar expressions; or decoding the JSON file containing the study details has failed due to syntax error there; or the organism was not found in NCBI taxonomy.
Input ExamplesOutput
MTBLS["Homo sapiens", "Blank sample", "Lactobacillus sp. asf360;Parabacteroides sp. asf519", "Sus scrofa domesticus", "NCBITAXON:Thalassiosira pseudonana;NCBITAXON:Ruegeria pomeroyi"]["homo-sapiens", "mus-musculus", etc.]
MW["Homo sapiens", "Sus scrofa", "Sus Scrofa", "C57BL/6J Mouse", "Multi-species non-defined biofilm consortium", "Alexandrium catenella; Alexandrium tamarense"]["homo-sapiens", "mus-musculus", etc.]


Organism details are available in metabolomics repositories. The users usually use the scientific name that can be obtained from NCBI Taxonomy. However, some inconsistencies were encountered such as with writing the scientific name (e.g., Mus Musculus instead of Mus musculus), providing a common name (Goat instead of Capra hircus), or providing the source along with the value (e.g., NCBITAXON:Homo sapiens), or even typos.

Some values were ambiguous such as "Various", "Extract", "Multi-species non-defined biofilm consortium" or not even species, such as NMR buffer. Additionally, sometimes, more than one species was mentioned. The combination of species is usually standardized by putting ";" or "/" between the names, but still, the relation between the species is not clear (samples from multiple species vs one sample from a species tissue infected by another species). Lastly, the organism provided varys in rank. Mostly the species is provided, but sometimes it is the genus or strain.

However, even after taking all that was mentioned above, it is still clear that the most studied species are humans (Homo sapiens) and mice (Mus musculus)

A rough estimate of the percentages of all studies in MTBLS and MW repositories based on the organism

Here one can see the number of studies providing the organism and its value.

The number of studies in MTBLS and MW based on the organism