Towards improved assessment of functional similarity in large-scale screens: A study on indel length

Alexander Schönhuth, Raheleh Salari, Fereydoun Hormozdiari, Artem Cherkasov, S. Cenk Sahinalp

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Although insertions and deletions are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. Most alignment algorithms/programs only roughly reflect the evolutionary processes that result in gaps-which typically require further evaluation. Interestingly, it is widely believed that gaps are the predominant form of sequence variation resulting in structural and functional changes. Thus it is desirable to distinguish between gaps that reflect true point mutations and alignment artifacts when it comes to assessing the functional similarity of proteins based on computational alignments. Here we introduce pair hidden Markov model-based solutions to rapidly assess the statistical significance of gaps in alignments resulting from classical Needleman-Wunsch-like alignment procedures which implement affine gap penalty scoring schemes. Surprisingly, although it has a natural formulation, the emanating Markov chain problem had no known efficient solution thus far. In this article, we present the first efficient algorithm to solve it. We demonstrate that, when comparing paralogous protein pairs (from Escherichia coli) of equal alignment identity and similarity, alignments that contain gaps of significant length are significantly less similar in terms of functionality, as measured with respect to Gene Ontology (GO) term similarity. This demonstrates for the first time, in a formally sound manner, that insertions and deletions cause more severe functional changes between proteins than substitutions. Our method can be reliably employed to quickly filter alignment outputs for protein pairs that are more likely to be functionally similar and/or divergent and establishes a sound and useful add-on for large-scale alignment studies.

Original languageEnglish (US)
Pages (from-to)1-20
Number of pages20
JournalJournal of Computational Biology
Volume17
Issue number1
DOIs
StatePublished - Jan 1 2010
Externally publishedYes

Fingerprint

Alignment
Markov Chains
Gene Ontology
Proteins
Escherichia coli Proteins
Point Mutation
Artifacts
Protein
Deletion
Insertion
Acoustic waves
Similarity
Statistical Significance
Hidden Markov models
Efficient Solution
Scoring
Markov processes
Escherichia coli
Escherichia Coli
Markov Model

Keywords

  • Algorithms
  • Biochemical networks
  • Computational molecular biology
  • Gene expression
  • HMM
  • Machine learning
  • RNA
  • Secondary structure

ASJC Scopus subject areas

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Cite this

Towards improved assessment of functional similarity in large-scale screens : A study on indel length. / Schönhuth, Alexander; Salari, Raheleh; Hormozdiari, Fereydoun; Cherkasov, Artem; Cenk Sahinalp, S.

In: Journal of Computational Biology, Vol. 17, No. 1, 01.01.2010, p. 1-20.

Research output: Contribution to journalArticle

Schönhuth, Alexander ; Salari, Raheleh ; Hormozdiari, Fereydoun ; Cherkasov, Artem ; Cenk Sahinalp, S. / Towards improved assessment of functional similarity in large-scale screens : A study on indel length. In: Journal of Computational Biology. 2010 ; Vol. 17, No. 1. pp. 1-20.
@article{9c5bc03d55c64a47b7f896d74e29791a,
title = "Towards improved assessment of functional similarity in large-scale screens: A study on indel length",
abstract = "Although insertions and deletions are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. Most alignment algorithms/programs only roughly reflect the evolutionary processes that result in gaps-which typically require further evaluation. Interestingly, it is widely believed that gaps are the predominant form of sequence variation resulting in structural and functional changes. Thus it is desirable to distinguish between gaps that reflect true point mutations and alignment artifacts when it comes to assessing the functional similarity of proteins based on computational alignments. Here we introduce pair hidden Markov model-based solutions to rapidly assess the statistical significance of gaps in alignments resulting from classical Needleman-Wunsch-like alignment procedures which implement affine gap penalty scoring schemes. Surprisingly, although it has a natural formulation, the emanating Markov chain problem had no known efficient solution thus far. In this article, we present the first efficient algorithm to solve it. We demonstrate that, when comparing paralogous protein pairs (from Escherichia coli) of equal alignment identity and similarity, alignments that contain gaps of significant length are significantly less similar in terms of functionality, as measured with respect to Gene Ontology (GO) term similarity. This demonstrates for the first time, in a formally sound manner, that insertions and deletions cause more severe functional changes between proteins than substitutions. Our method can be reliably employed to quickly filter alignment outputs for protein pairs that are more likely to be functionally similar and/or divergent and establishes a sound and useful add-on for large-scale alignment studies.",
keywords = "Algorithms, Biochemical networks, Computational molecular biology, Gene expression, HMM, Machine learning, RNA, Secondary structure",
author = "Alexander Sch{\"o}nhuth and Raheleh Salari and Fereydoun Hormozdiari and Artem Cherkasov and {Cenk Sahinalp}, S.",
year = "2010",
month = "1",
day = "1",
doi = "10.1089/cmb.2009.0031",
language = "English (US)",
volume = "17",
pages = "1--20",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "1",

}

TY - JOUR

T1 - Towards improved assessment of functional similarity in large-scale screens

T2 - A study on indel length

AU - Schönhuth, Alexander

AU - Salari, Raheleh

AU - Hormozdiari, Fereydoun

AU - Cherkasov, Artem

AU - Cenk Sahinalp, S.

PY - 2010/1/1

Y1 - 2010/1/1

N2 - Although insertions and deletions are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. Most alignment algorithms/programs only roughly reflect the evolutionary processes that result in gaps-which typically require further evaluation. Interestingly, it is widely believed that gaps are the predominant form of sequence variation resulting in structural and functional changes. Thus it is desirable to distinguish between gaps that reflect true point mutations and alignment artifacts when it comes to assessing the functional similarity of proteins based on computational alignments. Here we introduce pair hidden Markov model-based solutions to rapidly assess the statistical significance of gaps in alignments resulting from classical Needleman-Wunsch-like alignment procedures which implement affine gap penalty scoring schemes. Surprisingly, although it has a natural formulation, the emanating Markov chain problem had no known efficient solution thus far. In this article, we present the first efficient algorithm to solve it. We demonstrate that, when comparing paralogous protein pairs (from Escherichia coli) of equal alignment identity and similarity, alignments that contain gaps of significant length are significantly less similar in terms of functionality, as measured with respect to Gene Ontology (GO) term similarity. This demonstrates for the first time, in a formally sound manner, that insertions and deletions cause more severe functional changes between proteins than substitutions. Our method can be reliably employed to quickly filter alignment outputs for protein pairs that are more likely to be functionally similar and/or divergent and establishes a sound and useful add-on for large-scale alignment studies.

AB - Although insertions and deletions are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. Most alignment algorithms/programs only roughly reflect the evolutionary processes that result in gaps-which typically require further evaluation. Interestingly, it is widely believed that gaps are the predominant form of sequence variation resulting in structural and functional changes. Thus it is desirable to distinguish between gaps that reflect true point mutations and alignment artifacts when it comes to assessing the functional similarity of proteins based on computational alignments. Here we introduce pair hidden Markov model-based solutions to rapidly assess the statistical significance of gaps in alignments resulting from classical Needleman-Wunsch-like alignment procedures which implement affine gap penalty scoring schemes. Surprisingly, although it has a natural formulation, the emanating Markov chain problem had no known efficient solution thus far. In this article, we present the first efficient algorithm to solve it. We demonstrate that, when comparing paralogous protein pairs (from Escherichia coli) of equal alignment identity and similarity, alignments that contain gaps of significant length are significantly less similar in terms of functionality, as measured with respect to Gene Ontology (GO) term similarity. This demonstrates for the first time, in a formally sound manner, that insertions and deletions cause more severe functional changes between proteins than substitutions. Our method can be reliably employed to quickly filter alignment outputs for protein pairs that are more likely to be functionally similar and/or divergent and establishes a sound and useful add-on for large-scale alignment studies.

KW - Algorithms

KW - Biochemical networks

KW - Computational molecular biology

KW - Gene expression

KW - HMM

KW - Machine learning

KW - RNA

KW - Secondary structure

UR - http://www.scopus.com/inward/record.url?scp=77149139160&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77149139160&partnerID=8YFLogxK

U2 - 10.1089/cmb.2009.0031

DO - 10.1089/cmb.2009.0031

M3 - Article

C2 - 20078394

AN - SCOPUS:77149139160

VL - 17

SP - 1

EP - 20

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 1

ER -