Statistics Of Sequence Comparison
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year has focussed on two area. First, we have continued development and implementation of a method to transform an amino acid substitution matrix for use in the comparison of proteins having non-standard amino acid compositions. We have described in detail and implemented a numerical procedure for accomplishing this transformation. The transformation may leave the matrix's relative entropy unconstrained, or it may constrain it to equal a specified value. We have investigated experimentally which approach is best for producing matrices that are sensitive for general purpose database searches. We have found that constraining matrices to have relative entropy near 0.44 nats on average yields the best results. Second, we have studied "standard" and "composition based" statistics for the translated nucleic-acid database search program tblastn. In general, statistics based on standard amino acid compositions yield extremely unreliable E-values, which frequently claim chance alignments to be highly statistically significant. By estimating the "composition" of a database sequence from a window surrounding any given alignment, we are able to scale the substitution matrix so that it yields accurate "composition based" E-values. This procedure has now been implemented in NCBI's version of tblastn.
View original record on NIH RePORTER →