COMPARISON OF PROTEIN SEQUENCES &STRUCTURES

$364,198R01FY2003LMNIH

University Of Virginia Charlottesville, Charlottesville VA

Investigators

Linked publications & trials

Paper 28902397 Paper 27010337 Paper 25976240 Paper 24509512 Paper 23995390 Paper 23749753 Paper 22539666 Paper 22058127 Paper 20693322 Paper 20307279 Paper 20064877 Paper 19948773 Paper 15919194 Paper 14751975 Paper 12427470 Paper 12424122 Paper 12096113 Paper 10980156

Abstract

DESCRIPTION (adapted from the Abstract): The long-term goal of our research is the development of better methods for identifying distantly related protein and DNA sequences, and to exploit our ability to detect distant homologies to explore the duplication, fusion, and other processes responsible for increases in protein diversity. Although similarity searching is now a routine first step in the characterization of newly determined sequences, we believe that additional improvements in similarity searching methods will allow investigators to look back more deeply in evolutionary time. Moreover, as complete protein sequence sets become available for more organisms, sequence information can be exploited more effectively for functional genomics and traditional biochemical problems. The availability of complete genome sequences, combined with reliable and sensitive sequence comparison algorithms, also allows us to test hypotheses about the possible emergence of novel proteins over the past 200-1,200 million years. Over the next five years, our specific aims are: (1) To extend the average look-back time provided by protein sequence similarity searching. We propose improvements to the scoring methods and statistics analysis of similarity scores that seek to push back the protein similarity-search horizon from 1.5-2-fold, to more than 2,000 million years for most protein families. (2) To develop a higher performance, more flexible and user-friendly FASTA package. (3) To study repeated domains in proteins. We will develop more quantitative methods for identifying both simple sequence and long-period repeats in proteins. We will characterize the fraction of repeat-containing proteins in proteomes, characterize the fraction of domain-structured proteins that are not internally repetitive, and ask whether these proteins duplicate or diverge with patterns that differ from "normal" single domain proteins. (4) To explore genome-scale protein evolution and to identify potential "novel" or "young" protein families or domains. Over the next 2-4 years, more than six genomes that have diverged in the last 400 million years - an evolutionary distance sufficiently short that we should be able to identify all protein homologs - will become available. We will compare complete genomes searching for newly emergent sequences. (5) We will develop and characterize unified methods for the simultaneous construction of alignments and phylogenies over multiple sequences. We will also develop standalone tree-based alignment heuristics capable of rapidly aligning large numbers of sequences.

View original record on NIH RePORTER →