The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

Shibu Yooseph, Granger Sutton, Douglas B. Rusch, Aaron L. Halpern, Shannon J. Williamson, Karin Remington, Jonathan A Eisen, Karla B. Heidelberg, Gerard Manning, Weizhong Li, Lukasz Jaroszewski, Piotr Cieplak, Christopher S. Miller, Huiying Li, Susan T. Mashiyama, Marcin P. Joachimiak, Christopher Van Belle, John Marc Chandonia, David A. Soergel, Yufeng ZhaiKannan Natarajan, Shaun Lee, Benjamin J. Raphael, Vineet Bafna, Robert Friedman, Steven E. Brenner, Adam Godzik, David Eisenberg, Jack E. Dixon, Susan S. Taylor, Robert L. Strausberg, Marvin Frazier, J. Craig Venter

Research output: Contribution to journalArticle

545 Citations (Scopus)

Abstract

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

Original languageEnglish (US)
JournalPLoS Biology
Volume5
Issue number3
DOIs
StatePublished - Mar 2007

Fingerprint

Expeditions
Oceans and Seas
oceans
Sampling
Proteins
proteins
sampling
DNA Repair Enzymes
Metagenomics
Protein Databases
Glutamate-Ammonia Ligase
Firearms
Genomics
Phosphoric Monoester Hydrolases
DNA Damage
Cluster Analysis
new family
glutamate-ammonia ligase
Peptide Hydrolases
DNA damage

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)

Cite this

Yooseph, S., Sutton, G., Rusch, D. B., Halpern, A. L., Williamson, S. J., Remington, K., ... Venter, J. C. (2007). The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biology, 5(3). https://doi.org/10.1371/journal.pbio.0050016

The Sorcerer II Global Ocean Sampling expedition : expanding the universe of protein families. / Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B.; Halpern, Aaron L.; Williamson, Shannon J.; Remington, Karin; Eisen, Jonathan A; Heidelberg, Karla B.; Manning, Gerard; Li, Weizhong; Jaroszewski, Lukasz; Cieplak, Piotr; Miller, Christopher S.; Li, Huiying; Mashiyama, Susan T.; Joachimiak, Marcin P.; Van Belle, Christopher; Chandonia, John Marc; Soergel, David A.; Zhai, Yufeng; Natarajan, Kannan; Lee, Shaun; Raphael, Benjamin J.; Bafna, Vineet; Friedman, Robert; Brenner, Steven E.; Godzik, Adam; Eisenberg, David; Dixon, Jack E.; Taylor, Susan S.; Strausberg, Robert L.; Frazier, Marvin; Venter, J. Craig.

In: PLoS Biology, Vol. 5, No. 3, 03.2007.

Research output: Contribution to journalArticle

Yooseph, S, Sutton, G, Rusch, DB, Halpern, AL, Williamson, SJ, Remington, K, Eisen, JA, Heidelberg, KB, Manning, G, Li, W, Jaroszewski, L, Cieplak, P, Miller, CS, Li, H, Mashiyama, ST, Joachimiak, MP, Van Belle, C, Chandonia, JM, Soergel, DA, Zhai, Y, Natarajan, K, Lee, S, Raphael, BJ, Bafna, V, Friedman, R, Brenner, SE, Godzik, A, Eisenberg, D, Dixon, JE, Taylor, SS, Strausberg, RL, Frazier, M & Venter, JC 2007, 'The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.', PLoS Biology, vol. 5, no. 3. https://doi.org/10.1371/journal.pbio.0050016
Yooseph, Shibu ; Sutton, Granger ; Rusch, Douglas B. ; Halpern, Aaron L. ; Williamson, Shannon J. ; Remington, Karin ; Eisen, Jonathan A ; Heidelberg, Karla B. ; Manning, Gerard ; Li, Weizhong ; Jaroszewski, Lukasz ; Cieplak, Piotr ; Miller, Christopher S. ; Li, Huiying ; Mashiyama, Susan T. ; Joachimiak, Marcin P. ; Van Belle, Christopher ; Chandonia, John Marc ; Soergel, David A. ; Zhai, Yufeng ; Natarajan, Kannan ; Lee, Shaun ; Raphael, Benjamin J. ; Bafna, Vineet ; Friedman, Robert ; Brenner, Steven E. ; Godzik, Adam ; Eisenberg, David ; Dixon, Jack E. ; Taylor, Susan S. ; Strausberg, Robert L. ; Frazier, Marvin ; Venter, J. Craig. / The Sorcerer II Global Ocean Sampling expedition : expanding the universe of protein families. In: PLoS Biology. 2007 ; Vol. 5, No. 3.
@article{e6bffd7e320841f7867c2ee684c7b7a9,
title = "The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.",
abstract = "Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.",
author = "Shibu Yooseph and Granger Sutton and Rusch, {Douglas B.} and Halpern, {Aaron L.} and Williamson, {Shannon J.} and Karin Remington and Eisen, {Jonathan A} and Heidelberg, {Karla B.} and Gerard Manning and Weizhong Li and Lukasz Jaroszewski and Piotr Cieplak and Miller, {Christopher S.} and Huiying Li and Mashiyama, {Susan T.} and Joachimiak, {Marcin P.} and {Van Belle}, Christopher and Chandonia, {John Marc} and Soergel, {David A.} and Yufeng Zhai and Kannan Natarajan and Shaun Lee and Raphael, {Benjamin J.} and Vineet Bafna and Robert Friedman and Brenner, {Steven E.} and Adam Godzik and David Eisenberg and Dixon, {Jack E.} and Taylor, {Susan S.} and Strausberg, {Robert L.} and Marvin Frazier and Venter, {J. Craig}",
year = "2007",
month = "3",
doi = "10.1371/journal.pbio.0050016",
language = "English (US)",
volume = "5",
journal = "PLoS Biology",
issn = "1544-9173",
publisher = "Public Library of Science",
number = "3",

}

TY - JOUR

T1 - The Sorcerer II Global Ocean Sampling expedition

T2 - expanding the universe of protein families.

AU - Yooseph, Shibu

AU - Sutton, Granger

AU - Rusch, Douglas B.

AU - Halpern, Aaron L.

AU - Williamson, Shannon J.

AU - Remington, Karin

AU - Eisen, Jonathan A

AU - Heidelberg, Karla B.

AU - Manning, Gerard

AU - Li, Weizhong

AU - Jaroszewski, Lukasz

AU - Cieplak, Piotr

AU - Miller, Christopher S.

AU - Li, Huiying

AU - Mashiyama, Susan T.

AU - Joachimiak, Marcin P.

AU - Van Belle, Christopher

AU - Chandonia, John Marc

AU - Soergel, David A.

AU - Zhai, Yufeng

AU - Natarajan, Kannan

AU - Lee, Shaun

AU - Raphael, Benjamin J.

AU - Bafna, Vineet

AU - Friedman, Robert

AU - Brenner, Steven E.

AU - Godzik, Adam

AU - Eisenberg, David

AU - Dixon, Jack E.

AU - Taylor, Susan S.

AU - Strausberg, Robert L.

AU - Frazier, Marvin

AU - Venter, J. Craig

PY - 2007/3

Y1 - 2007/3

N2 - Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

AB - Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

UR - http://www.scopus.com/inward/record.url?scp=34247282037&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247282037&partnerID=8YFLogxK

U2 - 10.1371/journal.pbio.0050016

DO - 10.1371/journal.pbio.0050016

M3 - Article

C2 - 17355171

AN - SCOPUS:33947235074

VL - 5

JO - PLoS Biology

JF - PLoS Biology

SN - 1544-9173

IS - 3

ER -