Nh3D: A reference dataset of non-homologous protein structures

B. Thiruv, Gerald Quon, S. A. Saldanha, B. Steipe

Research output: Contribution to journalArticle

18 Citations (Scopus)

Abstract

Background: The statistical analysis of protein structures requires datasets in which structural features can be considered independently distributed, i.e. not related through common ancestry, and that fulfil minimal requirements regarding the experimental quality of the structures it contains. However, non-redundant datasets based on sequence similarity invariably contain distantly related homologues. Here we provide a reference dataset of non-homologous protein domains, assuming that structural dissimilarity at the topology level is incompatible with recognizable common ancestry. The dataset is based on domains at the Topology level of the CATH database which hierarchically classifies all protein structures. It contains the best refined representatives of each Topology level, validates structural dissimilarity and removes internally duplicated fragments. The compilation of Nh3D is fully scripted. Results: The current Nh3D list contains 570 domains with a total of 90780 residues. It covers more than 70% of folds at the Topology level of the CATH database and represents more than 90% of the structures in the PDB that have been classified by CATH. We observe that even though all protein pairs are structurally dissimilar, some pairwise sequence identities after global alignment are greater than 30%. Conclusion: Nh3D is freely available as a reference dataset for the statistical analysis of sequence and structure features of proteins in the PDB. Regularly updated versions of Nh3D and the corresponding PDB-formatted coordinate sets are accessible from our Web site http://www.schematikon.org.

Original languageEnglish (US)
Article number12
JournalBMC Structural Biology
Volume5
DOIs
StatePublished - Jul 12 2005
Externally publishedYes

Fingerprint

Proteins
Databases
Sequence Analysis
Datasets
Protein Domains

ASJC Scopus subject areas

  • Structural Biology

Cite this

Nh3D : A reference dataset of non-homologous protein structures. / Thiruv, B.; Quon, Gerald; Saldanha, S. A.; Steipe, B.

In: BMC Structural Biology, Vol. 5, 12, 12.07.2005.

Research output: Contribution to journalArticle

@article{c8670bcd692c4fc58e22e330871c37b2,
title = "Nh3D: A reference dataset of non-homologous protein structures",
abstract = "Background: The statistical analysis of protein structures requires datasets in which structural features can be considered independently distributed, i.e. not related through common ancestry, and that fulfil minimal requirements regarding the experimental quality of the structures it contains. However, non-redundant datasets based on sequence similarity invariably contain distantly related homologues. Here we provide a reference dataset of non-homologous protein domains, assuming that structural dissimilarity at the topology level is incompatible with recognizable common ancestry. The dataset is based on domains at the Topology level of the CATH database which hierarchically classifies all protein structures. It contains the best refined representatives of each Topology level, validates structural dissimilarity and removes internally duplicated fragments. The compilation of Nh3D is fully scripted. Results: The current Nh3D list contains 570 domains with a total of 90780 residues. It covers more than 70{\%} of folds at the Topology level of the CATH database and represents more than 90{\%} of the structures in the PDB that have been classified by CATH. We observe that even though all protein pairs are structurally dissimilar, some pairwise sequence identities after global alignment are greater than 30{\%}. Conclusion: Nh3D is freely available as a reference dataset for the statistical analysis of sequence and structure features of proteins in the PDB. Regularly updated versions of Nh3D and the corresponding PDB-formatted coordinate sets are accessible from our Web site http://www.schematikon.org.",
author = "B. Thiruv and Gerald Quon and Saldanha, {S. A.} and B. Steipe",
year = "2005",
month = "7",
day = "12",
doi = "10.1186/1472-6807-5-12",
language = "English (US)",
volume = "5",
journal = "BMC Structural Biology",
issn = "1472-6807",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Nh3D

T2 - A reference dataset of non-homologous protein structures

AU - Thiruv, B.

AU - Quon, Gerald

AU - Saldanha, S. A.

AU - Steipe, B.

PY - 2005/7/12

Y1 - 2005/7/12

N2 - Background: The statistical analysis of protein structures requires datasets in which structural features can be considered independently distributed, i.e. not related through common ancestry, and that fulfil minimal requirements regarding the experimental quality of the structures it contains. However, non-redundant datasets based on sequence similarity invariably contain distantly related homologues. Here we provide a reference dataset of non-homologous protein domains, assuming that structural dissimilarity at the topology level is incompatible with recognizable common ancestry. The dataset is based on domains at the Topology level of the CATH database which hierarchically classifies all protein structures. It contains the best refined representatives of each Topology level, validates structural dissimilarity and removes internally duplicated fragments. The compilation of Nh3D is fully scripted. Results: The current Nh3D list contains 570 domains with a total of 90780 residues. It covers more than 70% of folds at the Topology level of the CATH database and represents more than 90% of the structures in the PDB that have been classified by CATH. We observe that even though all protein pairs are structurally dissimilar, some pairwise sequence identities after global alignment are greater than 30%. Conclusion: Nh3D is freely available as a reference dataset for the statistical analysis of sequence and structure features of proteins in the PDB. Regularly updated versions of Nh3D and the corresponding PDB-formatted coordinate sets are accessible from our Web site http://www.schematikon.org.

AB - Background: The statistical analysis of protein structures requires datasets in which structural features can be considered independently distributed, i.e. not related through common ancestry, and that fulfil minimal requirements regarding the experimental quality of the structures it contains. However, non-redundant datasets based on sequence similarity invariably contain distantly related homologues. Here we provide a reference dataset of non-homologous protein domains, assuming that structural dissimilarity at the topology level is incompatible with recognizable common ancestry. The dataset is based on domains at the Topology level of the CATH database which hierarchically classifies all protein structures. It contains the best refined representatives of each Topology level, validates structural dissimilarity and removes internally duplicated fragments. The compilation of Nh3D is fully scripted. Results: The current Nh3D list contains 570 domains with a total of 90780 residues. It covers more than 70% of folds at the Topology level of the CATH database and represents more than 90% of the structures in the PDB that have been classified by CATH. We observe that even though all protein pairs are structurally dissimilar, some pairwise sequence identities after global alignment are greater than 30%. Conclusion: Nh3D is freely available as a reference dataset for the statistical analysis of sequence and structure features of proteins in the PDB. Regularly updated versions of Nh3D and the corresponding PDB-formatted coordinate sets are accessible from our Web site http://www.schematikon.org.

UR - http://www.scopus.com/inward/record.url?scp=23944502586&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=23944502586&partnerID=8YFLogxK

U2 - 10.1186/1472-6807-5-12

DO - 10.1186/1472-6807-5-12

M3 - Article

C2 - 16011803

AN - SCOPUS:23944502586

VL - 5

JO - BMC Structural Biology

JF - BMC Structural Biology

SN - 1472-6807

M1 - 12

ER -