COMPUTER ANALYSIS OF AMINO ACID AND NUCLEOTIDE SEQUENCES

$0Z01FY2001LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 16218944 Paper 14663142 Paper 12804090 Paper 11599029 Paper 10642881

Abstract

The goal of this project is to define and analyze, using computational methods, segments of protein and nucleotide sequences showing compositional bias (low-complexity regions or domains) and to understand their structural, functional and evolutionary significance, and their pathology. In protein sequences, these regions comprise a large proportion of the genome encoded amino acids (approximately 25% in most eukaryotes,and most of the translated protein sequences contain at least one such region). They may contain homopolymeric tracts or mosaics of a few amino acids, or repeated patterns, frequently subtle, including those typical of many non-globular domains. New mathematical definitions and algorithms are continuing to be developed to make unbiased identification of low-complexity segments, and to discover and analyze properties of these regions relevant to their structures, interactions and biological functions. Interspersed low-complexity sequences are particularly abundant in many eukaryotic proteins crucial in morphogenesis and embryonic development, RNA processing, transcriptional regulation, signal transduction and aspects of cellular and extracellular structural integrity. Structural data indicate that low complexity segments of proteins are generally non-globular or conformationally mobile. However, knowledge of the molecular structures and dynamics of these domains is still very limited because they are generally relatively intractable to investigation by crystallography and NMR, and they account for less than 1% of the residues in current structural databases. Hence, mathematically rigorous sequence analysis provides a primary methodology for gaining insights into their biology, and for raising questions to be investigated expermentally. These methods are also valuable, for both nucleotide and amino acid sequences, in detecting and eliminating some artifacts in sequence database searches and alignment analysis.

View original record on NIH RePORTER →