FINDING PROTEIN SEQUENCE MOTIFS--METHODS AND APPLICATION

$0Z01FY2001LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Abstract

In the last few years, rapid accumulation of genome sequences and protein structures has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI formed the basis of our work on protein motif analysis. A new mode of PSI-BLAST application which includes exhaustive database search by repeating PSI-BLAST iterations to convergence with newly identified protein family members was developed and implemented in an automatic procedure. Two other new procedures, IMPALA and RPS-BLAST allow one to search a library of protein family profiles by using an individual protein sequence as a query. The BLAST-CLUST procedure was developed to flexibly cluster proteins by sequence similarity using BLAST search outputs in the input. These methods were applied to perform a systematic survey of completely sequenced genomes and to produce a census of protein structural folds. A theoretical study on prediction of the total number of protein folds and families was performed; the estimates of approximately 1000 for the former and approximately 5000 for the latter were produced. The evolutionary history and phyletic distribution of several types of protein domains were analyzed in detail, including different types of Holliday junction resolvases, two families of NTPases and novel classes of proteases and nucleic-acid-binding domains.

View original record on NIH RePORTER →