GGrantIndex
← Search

Improvements And Extensions To The Blast Algorithms

$231,056Z01FY2008LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Abstract

Work this year has focused on an analysis of the application of[unreadable] the minimum description length (MDL) principle to the specification[unreadable] of pseudocounts for use with PSI-BLAST. In brief, PSI-BLAST estimates[unreadable] the probabilities of amino acids occurring in an alignment position[unreadable] by combining observed amino acid counts with data-dependent pseudocounts.[unreadable] The number of pseudocounts has in the past been chosen empirically.[unreadable] We have used the MDL principle to derive from theory the number of[unreadable] pseudocounts that should be employed. The theory suggests that for[unreadable] a realistic number n of sequences in a multiple alignment, the number[unreadable] of pseudocounts should be practically independent of n. However, the[unreadable] theory also suggests that the number of pseudocounts should depend on[unreadable] the composition of alignment columns, with more highly constrained[unreadable] columns receiving fewer pseudocounts. This has the effect of increasing[unreadable] the "contrast" in scores between protein profile columns. We have[unreadable] implemented a new procedure for specifying pseudocounts, and found[unreadable] that it improves PSI-BLAST retrieval accuracy to a statistically[unreadable] significant extent. This new procedure is now used by default by[unreadable] PSI-BLAST, and a manuscript describing the work is in preparation.

View original record on NIH RePORTER →