Computationally efficient estimation of the error rates of hidden Markov model results

$300,000FY2009CSENSF

Health Research Incorporated/New York State Department Of Health, Menands NY

Investigators

Abstract

NSF proposal: 0914739 PI: Newberg, Lee A. Computationally efficient estimation of the error rates of hidden Markov model results Hidden Markov models are employed in a wide variety of fields, including speech recognition, econometrics, computer vision, signal processing, cryptanalysis, and computational biology. In speech recognition, hidden Markov models can be used to distinguish one word from another based upon the time series of certain qualities of a sound. In finance, the models can be used to simulate the unknown transitions between low, medium, and high debt default regimes in time. In computer vision they can be used to decode American Sign Language (ASL). Hidden Markov models are used in computational biology to find similarity between sequences of nucleotides (DNA or RNA) or polypeptides (proteins) and to predict protein structure. Hidden Markov models are employed because they permit the facile description and implementation of powerful statistical models and algorithms that are used to score a match possibility in sequence data. Perhaps the most common use of hidden Markov models is for the purpose of hypothesis testing or classification. For instance, a speech-recognition model may be used to quantify the belief that a recorded message contains the word ?elephant.? However, once a score for a belief has been computed, the question is how to interpret that value. 1. Is the score strong enough to indicate a signal, or is it reasonably probable that noise will yield a score this strong? 2. Is the score weak enough to indicate noise, or is it reasonably probable that a signal will yield a score this weak? The false positive rate (closely related to the type I error or p-value) for a score threshold is the probability that noise data will yield a score at least as strong as the threshold. The false negative rate for a score threshold is the probability that signal data will fail to score at least as strong as the threshold. In 2008, Newberg designed a method for estimating error rates that is more efficient than other approaches that are applicable to general hidden Markov models. However the approach is still too slow for computationally intensive applications such as repeated searches of large DNA databases. This proposed research aims to speed the estimation primarily via two approaches: (1) the creative re-use of simulations, and (2) statistically robust elimination of outlier simulation results. The proposed research is significant because the facile availability of error rates permits researchers in a wide variety of scientific fields to evaluate the statistical significance of their conclusions and the power of their hypothesis tests. Once the technique and software are available, researchers in speech recognition or ASL recognition will be able to use rigorously derived statistical significance values to set their hypothesis test thresholds for word recognition. Financial modelers will have a rigorous standard by which to evaluate their market timing models. Computational biologists will have rigorous statistical significance values for their sequence alignments and for their pattern scans of large sequence databases. More generally, the availability of error rates for hidden Markov model results will significantly enhance the attractiveness of hidden Markov models for use in fields where hidden Markov models are not currently employed.

View original record on NSF Award Search →