Detecting Homology in the "Twilight Zone" of Sequence Similarity

  • Patterson, Randen L (PI)

Project: Research project

Project Details


DESCRIPTION (provided by applicant): The `protein problem'has remained unsolved despite decades of research [1, 2]. In principle, one expects that the primary amino acid sequence of a protein determines its structure, function, and evolutionary (SF&E) characteristics. Yet, there still is no reliable method for predicting the native state structure of a protein and its function given only its sequence. In addition, inferring the evolutionary relationships among highly divergent protein sequences is a daunting task. In general, when pairwise sequence alignments between protein sequences fall below 25% identity, statistical measurements do not provide support robust enough to identify clear phylogenetic relationships despite intensive research in this area [1, 3, 4]. The recent explosion in the availability of knowledge bases and computational techniques for the analysis of complex data has created an unprecedented opportunity for teasing out invaluable information from protein sequences. Starting with a basic premise that protein sequence encodes information about SF&E, we developed a unified framework for inferring SF&E from sequence information using a knowledge-based approach in which we measure the similarity between a query sequence and a set of biologically relevant profiles in an unbiased manner. Results from this Gestalt Domain Detection Algorithm-Basic Local Alignment Tool (GDDA-BLAST) provide phylogenetic profiles that have the capacity to model SF&E relationships of various proteins. Indeed, GDDA-BLAST is capable of deriving deep phylogenetic relationships for highly divergent proteins in a quantifiable manner [5, 6]. Preliminary results from our computational case study of the highly divergent family of retroelements accord with those previously reported, and demonstrate that GDDA-BLAST measurements can be treated as "fingerprints" that can be used to derive distance estimates and hence phylogenetic relationships without prior information, multiple sequence alignment, or manual editing. We propose that sequence information present within the "twilight zone" of sequence similarity can provide key insight into SF&E relationships among distantly related and/or rapidly evolving proteins. This proposal aims to push our limits of detecting homology within the "twilight zone" of sequence similarity by evaluating and optimizing GDDA-BLAST performance on benchmark and experimental data sets. Armed with these refined GDDA- BLAST measurements we propose to conduct a comprehensive, ab initio, phylogenetic study of retroelements and RNA dependent RNA polymerases from the positive-strand family of RNA viruses (+ssRNA). Simultaneously we will derive high-resolution maps of domain boundaries and empirically validate functional annotations and predictions of key residues for those activities. This work aims to perform translational research from the computer to the laboratory bench top. We expect that the tools and resources generated from this grant will be accessible and user-friendly to the bench scientist, thereby speeding the discovery process of other clinically relevant research endeavors. PUBLIC HEALTH RELEVANCE: The long-term implication of this proposal is the development of a unified framework for high-resolution and simultaneous measurements of structure, function, and evolution. Should this be possible: (i) functional and evolutionary measurements could quantitatively inform structural modeling to derive accurate atomic resolution protein structures, (ii) structural and functional measurements could inform evolutionary histories to derive accurate evolutionary rates, deep-branch relationships, and homologous spaces within each protein, and (iii) structural and evolutionary measures would inform as to the location of functionalities contained within any protein and the regulatory elements which control these functions. Armed with this information, the speeds at which diseases could be understood and pharmacophores/therapies developed to combat them would likely increase dramatically.
Effective start/end date4/10/093/31/14


  • National Institutes of Health: $232,788.00
  • National Institutes of Health: $141,578.00
  • National Institutes of Health: $132,732.00
  • National Institutes of Health: $265,987.00
  • National Institutes of Health: $232,472.00


  • Medicine(all)
  • Biochemistry, Genetics and Molecular Biology(all)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.