How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry

Tobias Kind, Martin Scholz, Oliver Fiehn

Research output: Contribution to journalArticle

82 Citations (Scopus)

Abstract

Background:Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites.Results:As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonica and oryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases.Conclusions:We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons.

Original languageEnglish (US)
Article numbere0005440
JournalPLoS One
Volume4
Issue number5
DOIs
StatePublished - 2009

Fingerprint

metabolome
Metabolome
Electronic data interchange
data analysis
chemistry
Databases
Metabolites
metabolites
Genes
rice
Oryza
peers
Chemical Databases
Genome Size
Knowledge Bases
Molecules
genome
Metabolic Networks and Pathways
Biological Products
patents

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry. / Kind, Tobias; Scholz, Martin; Fiehn, Oliver.

In: PLoS One, Vol. 4, No. 5, e0005440, 2009.

Research output: Contribution to journalArticle

Kind, Tobias ; Scholz, Martin ; Fiehn, Oliver. / How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry. In: PLoS One. 2009 ; Vol. 4, No. 5.
@article{518c9e69220441d6a995bb29712af095,
title = "How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry",
abstract = "Background:Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites.Results:As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonica and oryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases.Conclusions:We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons.",
author = "Tobias Kind and Martin Scholz and Oliver Fiehn",
year = "2009",
doi = "10.1371/journal.pone.0005440",
language = "English (US)",
volume = "4",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "5",

}

TY - JOUR

T1 - How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry

AU - Kind, Tobias

AU - Scholz, Martin

AU - Fiehn, Oliver

PY - 2009

Y1 - 2009

N2 - Background:Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites.Results:As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonica and oryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases.Conclusions:We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons.

AB - Background:Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites.Results:As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonica and oryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases.Conclusions:We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons.

UR - http://www.scopus.com/inward/record.url?scp=67649785231&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67649785231&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0005440

DO - 10.1371/journal.pone.0005440

M3 - Article

C2 - 19415114

AN - SCOPUS:67649785231

VL - 4

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 5

M1 - e0005440

ER -