AF: Medium: Collaborative Research: Sequential and Parallel Algorithms for Approximate Sequence Matching with Applications to Computational Biology

$525,000FY2017CSENSF

Georgia Tech Research Corporation, Atlanta GA

Investigators

Abstract

Sequence matching problems are central to the field of genomics, both in analyzing naturally occurring sequences such as genomes and in analyzing data from sequencing instruments. Often, methods that can accommodate a small number of differences within the matching regions suffice in practice. Such methods, described as alignment-free or approximate sequence matching methods, have typically relied on heuristics. This project work is advancing the field by creating a mathematical framework and solving multiple approximate sequence matching problems with provably efficient run-time guarantees. Project work is also supporting the development of practical heuristics inspired and supported by the mathematical framework, development of parallel methods for solving large-scale problems on high performance parallel computers, and studying the impact of these methods on important applications. Project results are made available through open source software for use by practitioners. Results from this research will be incorporated into courses taught by the PIs, and disseminated more broadly through book chapters and tutorials and accompanying slides. The project will support research scientist and Ph.D. students in interdisciplinary training for launching them into productive careers focused on important problems of current relevance. Undergraduate participation is planned through course projects. Project work builds upon recent progress in alignment-free genome comparison methods, and exploits the controlled error characteristics of data generated by high-throughput sequencers, and the many bioinformatics applications enabled by them. Project objectives include developing a robust algorithmic framework for designing newer alignment-free methods based on approximate substring composition, and developing sequential and parallel algorithms for pairwise approximate sequence matching among large sequence data sets. The goal is to develop algorithms that are asymptotically superior to quadratic alignment-based approaches, and achieve good practical performance either directly or through further development of practical heuristic that rely on the underlying theory. The developed techniques will be further investigated in the context of important applications such as read error correction, genome mapping, and assembly. Though conducted in the context of computational biology, some of the methods are potentially applicable to other areas such as text processing and information retrieval. Broader research community will be impacted through release of software modules and project work in important application areas.

View original record on NSF Award Search →