Improvements And Extensions To The Blast Algorithms
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
The BLAST family of protein and DNA database search programs constitute one of the key services offered by the NCBI. These programs are currently run on NCBI servers about 70,000 times during a typical weekday. This project represents an ongoing effort to improve and extend the functionality of these programs. Efforts this year have focussed on the improvement of the PSI-BLAST program: PSI-BLAST searches a database of protein sequences using a position-specific score matrix (PSSM) as query. The PSSMs used are generally constructed on the fly, through multiple iterations of database searching, initiated with a standard protein sequence. PSI-BLAST has been widely used to annotate proteins inferred from new DNA sequences, and to generate sets of PSSMs representing large classes of proteins. In order to improve the sensitivity of the PSI-BLAST program to distant sequence relationships, we developed in previous years a system to evaluate the program's performance. For a set of about 100 query sequences, experts in the group compiled an exhaustive list of related proteins in yeast. The queries can then be compared to a comprehensive protein sequence databease through an arbitary number of PSI-BLAST iterations, and the resulting PSSM compared to the complete yeast sequence. This procedure generates a list of yeast sequences ordered by E-value, from which a plot of false positives vrs. true positives may be obtained. We continued to use our evaluation system to test potential improvements to PSI-BLAST in detecting distant relationships, and to compare PSI-BLAST to other related programs. Several avenues were pursued this year: 1) We tested the relative sensitivity of the BLOSUM and OPTIMA scoring systems, and found BLOSUM to be superior; 2) We investigated which parameters relating to the heuristic nature of the BLAST algorithm had the most bearing on PSI-BLAST accuracy, as well as the tradeoff between speed and sensitivity implicit in adjusting these parameters; 3) We compared PSI-BLAST to the related program SAM, and found PSI-BLAST to be both faster and much more accurate in detecting distant relationships; 4) We investigated further the effects of "window-based" composition calculations, and determined that a larger test set will be required to study this procedure; 5) We began implementation of a "hybrid" local alignment scoring system that should permit the introduction of position-specific gap costs. On a separate front, we tested the relative sensitivity of using contiguous and non-contiguous "hits" for BLAST in both the DNA and protein contexts. Non-contiguous hits appear to yield no advantage for protein BLAST searches, but may provide a substantial improvement for DNA BLAST searches.
View original record on NIH RePORTER →