AMC-SS: Markovian Embeddings for the Analysis and Computation of Patterns in non-Markovian Random Sequences

$180,000FY2008MPSNSF

University Of Colorado At Boulder, Boulder CO

Investigators

Abstract

The PI aims to develop new tools for the systematic analysis of patterns in random sequences with an arbitrary correlation structure. The project considers theoretical and computational aspects associated with possibly non-regular patterns in non-Markovian sequences. Previous work by the PI has shown the existence and uniqueness of optimal Markovian structures to keep track of the number of matches with a given pattern in a random sequence. At a theoretical level, the project aims to identify ergodic Markovian structures in embeddings that result in non-ergodic behavior due to small perturbations, and to use these ergodic structures to characterize the asymptotic distribution of the number of matches with a pattern in a non-Markovian sequence. It also compares the entropy of the optimal Markovian embedding of a non-Markovian sequence with that of the original sequence. At a computational level, the PI aims to apply generating function techniques in conjunction with Markovian embeddings to characterize the joint asymptotic distribution of the number of matches with several patterns. The PI also aims to quantify the error in approximating the number of matches with a pattern in a sequence of a moderate size with that of a Binomial distribution. A secondary goal of the project is to develop a method for the symbolic specification of branching processes that will represent the solutions of non-linear o.d.e.'s probabilistically and to explore new connections between these and the analysis of patterns in non-Markovian sequences. The analysis of patterns in random sequences lies at the core of several emerging fields such as computational biology, security systems, speech recognition and text mining. Random sequences of characters are used to model diverse phenomena ranging from written text to genomic sequences to audit files. In these, exceptional patterns, i.e., over- or under-represented words, may provide a great deal of insight or knowledge. For instance, a widely used heuristic is that overrepresented patterns in DNA may be key for gene expression whereas underrepresented patterns may interfere with this process. As another example, hackers leave traces of their intrusions into secured databases in audit files and unusual patterns may be used to warn of potential security breaches. However, one cannot assess how truly exceptional a pattern is without a bona fide statistical model of the text in which it is immersed. The prevalent models cannot accommodate the long-range correlations present in most types of text. For example, the characters occurring in written text or audit files follow syntax rules and, in RNA sequences, palindromic structures induced by base-pairing convey genome-wide correlations. Unfortunately, the available techniques to assess how exceptional a pattern is, cannot systematically handle the more realistic models. Furthermore, the PI has recently shown that the widely used paradigm that the number of matches with a pattern in a long text is approximately Gaussian distributed does not necessarily apply when long-range correlations are present. Due to these considerations the PI aims to develop new qualitative and quantitative tools to address systematically the occurrence of highly complex patterns under more realistic statistical models of text.

View original record on NSF Award Search →