Statistics Of Sequence Comparison

$0Z01FY2003LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 18586708 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year has focussed primarily on the development and implementation of a method to transform an amino acid substitution for use in the comparison of proteins having non-standard amino acid compositions. We have shown that a substitution matrix, viewed as a log-odds matrix of target and background frequencies, can be valid, in the sense of having consistent target and background frequencies, only in a unique context. Accordingly, all standard matrices, such as the PAM or BLOSUM series, are appropriately used only for the comparison of proteins with standard amino acid compositions. We have developed a rartionale for transforming the target frequencies implicit in a standard matrix to new target frequencies that are consistent with non-standard background frequencies. Using Lagrange multipliers, we have developed an efficient Newtonian procedure for calculating these new target frequencies. We have compared compositionally-adjusted BLOSUM-62 substitution matrices with standard BLOSUM-62 substitution matrices on four specially-developed test sets involving sequences from organisms with biased nucleotide compositions, as well as biased proteomes. The new matrices have evinced improved performance, from the perspectives both of improved bit score and improved alignment accuracy.

View original record on NIH RePORTER →