Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Thomas J. Sharpton, Guillaume Jospin, Dongying Wu, Morgan G I Langille, Katherine S. Pollard, Jonathan A Eisen

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Background: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.Results: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as " Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology-based analyses.Conclusions: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).

Original languageEnglish (US)
Article number264
JournalBMC Bioinformatics
Volume13
Issue number1
DOIs
StatePublished - Oct 13 2012

Fingerprint

Cluster Analysis
Genome
Genes
Clustering
Proteins
Protein
Resources
Protein Databases
Protein Sequence
Metagenome
Sequencing
Sequence Alignment
Update
Sequence Homology
Family
Multiple Sequence Alignment
Hidden Markov models
Phylogenetic Tree
Databases
Network Topology

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics
  • Structural Biology

Cite this

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource. / Sharpton, Thomas J.; Jospin, Guillaume; Wu, Dongying; Langille, Morgan G I; Pollard, Katherine S.; Eisen, Jonathan A.

In: BMC Bioinformatics, Vol. 13, No. 1, 264, 13.10.2012.

Research output: Contribution to journalArticle

Sharpton, Thomas J. ; Jospin, Guillaume ; Wu, Dongying ; Langille, Morgan G I ; Pollard, Katherine S. ; Eisen, Jonathan A. / Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource. In: BMC Bioinformatics. 2012 ; Vol. 13, No. 1.
@article{aabdb9b1ffc84b68ba6bd9b942c05d8d,
title = "Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource",
abstract = "Background: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.Results: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as {"} Sifting Families,{"} or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology-based analyses.Conclusions: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).",
author = "Sharpton, {Thomas J.} and Guillaume Jospin and Dongying Wu and Langille, {Morgan G I} and Pollard, {Katherine S.} and Eisen, {Jonathan A}",
year = "2012",
month = "10",
day = "13",
doi = "10.1186/1471-2105-13-264",
language = "English (US)",
volume = "13",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

AU - Sharpton, Thomas J.

AU - Jospin, Guillaume

AU - Wu, Dongying

AU - Langille, Morgan G I

AU - Pollard, Katherine S.

AU - Eisen, Jonathan A

PY - 2012/10/13

Y1 - 2012/10/13

N2 - Background: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.Results: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as " Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology-based analyses.Conclusions: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).

AB - Background: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.Results: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as " Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology-based analyses.Conclusions: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).

UR - http://www.scopus.com/inward/record.url?scp=84867294452&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867294452&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-13-264

DO - 10.1186/1471-2105-13-264

M3 - Article

VL - 13

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 264

ER -