Statistics Of Sequence Comparison

$0Z01FY2004LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 18586708 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year has focussed primarily on the continued development and implementation of a method to transform an amino acid substitution for use in the comparison of proteins having non-standard amino acid compositions. We have shown that for a variety of sets of related protein pairs, at least one of which has biased composition, the new matrices improve the performance of pairwise alignment algorithms from the perspective both of improved bit score and improved alignment accuracy. We previously showed that, except for certain degenerate cases, a substitution matrix, viewed as a log-odds matrix of target and background frequencies, can be valid, in the sense of having consistent target and background frequencies, only in a unique context. This year we described in detail which matrices are degenerate, and how they may be recognized. We also described the mathematical details of how to calculate the target and background frequencies, and natural scale, implicit in any non-degenerate matrix, as well as the details of applying Lagrange multipliers in an efficient Newtonian procedure for transforming a standard matrix for use in a non-standard context. Elements in the resulting matrix always have the simple form of a constant times the original substitution score, plus amino-acid specific scores for each of the two sequences compared.

View original record on NIH RePORTER →