Statistics of Sequence Comparison

$260,305ZIAFY2012LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 30596639 Paper 29336305 Paper 28771374 Paper 28002465 Paper 27192614 Paper 25294922 Paper 23294268 Paper 21998158 Paper 21702692 Paper 21702690 Paper 21128852 Paper 20657661 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

The primary focus this year was on the assessment of substitution scoring systems for aligning protein profiles to one another. Pairwise protein sequence alignments are generally evaluated using scores defined as the sum of substitution scores for aligning amino acids to one another, and gap scores for aligning runs of amino acids in one sequence to null characters inserted into the other. Protein profiles may be abstracted from multiple alignments of protein sequences, and substitution and gap scores have been generalized to the alignment of such profiles either to single sequences or to other profiles. Although there is widespread agreement on the general form substitution scores should take for profile-sequence alignment, little consensus has been reached on how best to construct profile-profile substitution scores, and a large number of these scoring systems have been proposed. We assessed a variety of such substitution scores, using several sets of gold standard multiple alignments. For our evaluation, we calculated the probability that a profile column yields a higher substitution score when aligned to a related than to an unrelated column. We also considered the same measure applied to sets of two or three adjacent columns. This simple approach had the advantages that it did not depend primarily upon the gold standard alignment columns with the weakest empirical support, and that it did not need to fit gap and offset costs for use with each substitution cost studied. No substitution scoring system emerges as superior in all our tests, but two show consistently strong behavior: a generalization of profile-sequence scores similar to those used in the Compass alignment program, and the recently proposed Bayesian Integral Log-odds (BILD) scores. A secondary focus was on the issues related to the Dirichlet mixture model, used to analyze protein sequences. The Dirichlet mixture model was introduced to protein sequence analysis by a Haussler's group at UCSC. In brief, this model imagines a particular position in a protein family is described by a multinomial distribution on the set of amino acids. Although the multinomial for a particular position may be unique, the study of many protein families reveals that certain regions of multinomial space are much more heavily populated than others. This general knowledge may be summarized by a Dirichlet mixture prior, which is a probability density over multinomial space that lends itself to easy analysis. Our research on Dirichlet mixture priors this year centered on the question of how best to derive such priors from a set of multiple alignment data. Our previous work had applied the Minimum Description Length principle and a Gibbs sampling algorithm to this problem. Work begun this year applied the Dirichlet Process to this problem, which preliminary results suggest leads to much improved mixtures with many more components.

View original record on NIH RePORTER →