Statistics of Sequence Comparison

$391,740ZIAFY2010LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 30596639 Paper 29336305 Paper 28771374 Paper 28002465 Paper 27192614 Paper 25294922 Paper 23294268 Paper 21998158 Paper 21702692 Paper 21702690 Paper 21128852 Paper 20657661 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

Work this year focused on scoring systems for multiple alignment. Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. Substitution scores for local pairwise alignment are implicitly of log-odds form, comparing the probabilities of aligning two letters under models of relatedness and non-relatedness, and the best pairwise substitution scores are explicitly so constructed. We have developed ideas, based on the minimum description length principle, for extending this formalism to multiple alignments. Most simply, Bayesian methods can be used to derive "BILD" substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We have employed BILD scores in Gibbs sampling optimization procedures, and shown that they yield improved performance in constructing biologically accurate alignments. We have developed pilot programs for constructing gapped multiple alignments using BILD scores. Using artificial sequences, we have shown these programs to have superior performance to earlier programs at detecting domain boundaries. We have also appled them to the recognition and annotation of DNA-binding domains in Apicomplexan proteins. In related work, we have studied Dirichlet mixture models in the context of non-standard sequence composition. A Dirichlet mixture with M components over an alphabet of L letters has M*(L+1)-1 free parameters. If M = L/2, this is exactly as many as a symmetric pairwise substitution matrix. While each Dirichlet mixture implies a unique such matrix, we have shown that multiple mixtures can map to the same matrix, and some substitution matrices may not correspond to any Dirichlet mixture. A Dirichlet mixture for protein sequence analysis generally is constructed from a particular set of proteins, implying a particular background amino acid composition. The mixture should be non-optimal for the comparison of proteins with significantly different composition. We have described a sensible and efficient method for adjusting the parameters of a Dirichlet mixture so that they are consistent with any specified composition.

View original record on NIH RePORTER →