Position Sensitive P-Mer Frequency Clustering with Applications to Classification

$204,974R21FY2012HGNIH

Southern Methodist University, Dallas TX

Investigators

Margaret Holder Dunhamcontact Michael Hahsler Monnie McGee

Linked publications & trials

Abstract

DESCRIPTION (provided by applicant): Position Sensitive P-Mer Frequency Clustering with Applications to Classification and Differentiation Recent genomic sequencing advances, such as next generation sequencing, and projects like the Human Microbiome Project create extremely large genomic databases. Even though the length of any specific sequence may be much shorter than that of the complete DNA sequence of an organism, looking at enormous libraries of sequences, such as 16S rRNA, presents an equally (if not greater) computational challenge. In traditional genomic analysis, only one sequence may be analyzed at a time. When dealing with metagenomics, thousands (or more) sequences need to be analyzed at the same time. However, to study such problems as environmental biological diversity and human microbiome diversity this is exactly what is needed. Current techniques have several shortcomings which need to be addressed. Techniques involving sequence alignment are typically based on selection of one representative sequence (as is typically done when looking at 16S rRNA data) which introduces selection bias. Genomic databases involving multiple copies of 16S per organism across thousands of organisms, will soon grow too large to practically process just using computationally expensive alignment methods to match sequences, but faster alignment-free methods currently do not provide the needed accuracy and sensitivity. As a complement to existing methods we introduce a novel class of fast high-throughput algorithms based on quasi-alignment using position specific p-mer frequency clustering. Organisms are represented by a directed graph structure that summarizes the ordering between clusters of p-mer frequency histograms at different positions in sequences. This model can be learned using all available 16S copies of an organism and thus eliminates selection bias. Due to the added position information, these algorithms can be used for species (and even strain) classification facilitating the study of strain diversity within species. Our prototype implementation of this new technique shows that it is able to produce compact profiles which can be efficiently stored and used for large scale classification and differentiation down to the strain level. Since the technique incorporates high-throughput data stream clustering, a proven technique in high performance computing, it scales well for very large scale DNA/RNA sequence data as well as massive sets of short sequence snippets collected during metagenomic research. In this project we will develop a suite of tools, profile models, and scoring techniques to model RNA/DNA sequences providing applications of organism classification, and intra/inter-organism similarity/diversity. Our approach provides both the specificity needed to perform strain classification and still avoid the computational overhead of alignment. It is important to note that this is accomplished through dynamic online machine learning techniques without human intervention.

View original record on NIH RePORTER →