Statistics of Sequence Comparison

$228,773ZIAFY2014LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 30596639 Paper 29336305 Paper 28771374 Paper 28002465 Paper 27192614 Paper 25294922 Paper 23294268 Paper 21998158 Paper 21702692 Paper 21702690 Paper 21128852 Paper 20657661 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

The primary focus of the project this year has been work on adapting sequence logos, as originally described by Tom Schneider, to the log-odds formalism. Sequence logos represent DNA or protein motifs, captured by an input multiple alignment. A logo represents each position in the multiple alignment by a stack of letters whose individual heights are proportional to the observed letter frequencies, and whose aggregate height is proportional to the information at the position in question. Originally, information was defined as an entropy difference, perhaps with a correction for small sample size. Alternative definitions of information have also been proposed and implemented. These include the relative entropy implied by the observed letter frequencies and background frequencies attributable to chance. It has long been recognized that any scoring systems for local alignments, i.e. a system with negative expected score, is of the log-odds form log(Q/P), with implicit if not explicit target frequencies Q. These are the frequencies of aligned letters, among related sequences, that the scoring system is optimized to distinguish from chance, modeled by the probability P. For pairwise alignments, all popular local alignment scoring systems are constructed by explicitly specifying appropriate target frequencies Q. It was only in 2010, however, that a method for explicitly estimating target frequencies for multiple alignment columns was described in the bioinformatics literature. This project has sought to compare the resulting multiple-alignment log-odds scores to previously proposed scores, by the criterion of their effectiveness in recognizing biologically important positions, and to make log-odds scores available to researchers through a public sequence-logo construction program. The frequencies Q for log-odds scores may be constructed in a variety of ways. Among the simplest is normalized-maximum- likelihood (NML), in which Q is taken to be proportional to the likelihood of a column implied by a maximum-likelihood multinomial. It can be shown that NML log-odds scores are equivalent to relative-entropy scores, plus a correction term c(N) dependent on the number of observations N in the multiple alignment column. For DNA and protein multiple alignments, we have calculated c(N) explicitly for small N, and have derived an asymptotic formula of sufficient accuracy to be used for N where the explicitly calculating c(N) become infeasible. Although NML log-odds scores may be appropriate for DNA, in the protein alignment context they ignore prior knowledge concerning amino acid relationships. The alternative log-odds BILD scores, first described in 2010, are able to exploit this knowledge through their use of a Bayesian Dirichlet mixture prior, describing multiple alignment columns of related protein sequences. It is notable that BILD scores are essentially equivalent to NML scores when uninformative Jeffreys priors are used. Using an enzyme dataset, with active sites as a proxy for biologically important positions within proteins, we compared small-sample-size corrected and uncorrected entropy difference scores to NML scores and BILD scores using a recently developed Dirichlet mixture prior. In this comparison, log-odds scores proved superior to previously proposed multiple alignment scoring systems. Taking account of prior knowledge concerning amino acid relationships tended to raise the scores of all alignment positions, slightly favoring important positions. We implemented an online served to produce log-odds sequence logos. A paper describing this work has been submitted for publication. A secondary focus of this project has been preliminary work on applying BILD scores to the automatic articulation of protein subfamilies.

View original record on NIH RePORTER →