Stalking the fourth domain in metagenomic data: Searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Dongying Wu, Martin Wu, Aaron Halpern, Douglas B. Rusch, Shibu Yooseph, Marvin Frazier, J. Craig Venter, Jonathan A Eisen

Research output: Contribution to journalArticle

63 Citations (Scopus)

Abstract

Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species. Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences. Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them.

Original languageEnglish (US)
Article numbere18011
JournalPLoS One
Volume6
Issue number3
DOIs
StatePublished - 2011

Fingerprint

Stalking
Metagenomics
Genes
genetic markers
phylogeny
organisms
Ancient History
Expeditions
Sequence Alignment
Information Storage and Retrieval
ribosomal RNA
Oceans and Seas
Sequence Analysis
DNA sequences
Viruses
Genome
genes
sequence alignment
Sampling
methodology

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

Stalking the fourth domain in metagenomic data : Searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. / Wu, Dongying; Wu, Martin; Halpern, Aaron; Rusch, Douglas B.; Yooseph, Shibu; Frazier, Marvin; Venter, J. Craig; Eisen, Jonathan A.

In: PLoS One, Vol. 6, No. 3, e18011, 2011.

Research output: Contribution to journalArticle

Wu, Dongying ; Wu, Martin ; Halpern, Aaron ; Rusch, Douglas B. ; Yooseph, Shibu ; Frazier, Marvin ; Venter, J. Craig ; Eisen, Jonathan A. / Stalking the fourth domain in metagenomic data : Searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. In: PLoS One. 2011 ; Vol. 6, No. 3.
@article{41649096dbdf48cdbfd8ecd772a3c1a3,
title = "Stalking the fourth domain in metagenomic data: Searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees",
abstract = "Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species. Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences. Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them.",
author = "Dongying Wu and Martin Wu and Aaron Halpern and Rusch, {Douglas B.} and Shibu Yooseph and Marvin Frazier and Venter, {J. Craig} and Eisen, {Jonathan A}",
year = "2011",
doi = "10.1371/journal.pone.0018011",
language = "English (US)",
volume = "6",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "3",

}

TY - JOUR

T1 - Stalking the fourth domain in metagenomic data

T2 - Searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

AU - Wu, Dongying

AU - Wu, Martin

AU - Halpern, Aaron

AU - Rusch, Douglas B.

AU - Yooseph, Shibu

AU - Frazier, Marvin

AU - Venter, J. Craig

AU - Eisen, Jonathan A

PY - 2011

Y1 - 2011

N2 - Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species. Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences. Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them.

AB - Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species. Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences. Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them.

UR - http://www.scopus.com/inward/record.url?scp=79952806659&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952806659&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0018011

DO - 10.1371/journal.pone.0018011

M3 - Article

C2 - 21437252

AN - SCOPUS:79952806659

VL - 6

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 3

M1 - e18011

ER -