GGrantIndex
← Search

CRII: CIF: Model-based Compression of Biological Sequences

$175,000FY2018CSENSF

University Of Virginia Main Campus, Charlottesville VA

Investigators

Abstract

With the increasingly widespread use of high-throughput genome sequencing, the amount of biological sequence data is growing at a rate much faster than the decrease in the cost of storage media. To avoid saturating available storage capacity, such data must be compressed at a high ratio. Biological sequences are created over the course of evolution by mutation processes, including substitution, insertion, deletion, and duplication. While these processes shape the statistical properties of genomic sequences and play a critical role in determining which compression approaches will provide improved performance, they are not taken into account by current methods. The goal of this project is to provide a principled approach to biological data compression by developing and leveraging mutation models that approximate the generation process of genomic sequences. The main research thrusts of the project are: 1) determining the fundamental limits of the compressibility of biological sequences; and 2) developing and evaluating encoding and decoding algorithms that approach these limits. Identifying the limits of compression relies on developing combinatorial and stochastic string-editing models that represent sequence generation through genomic mutations. These models are then studied from an information-theoretic point of view to determine their combinatorial and stochastic capacities, thus providing bounds on the compressibility of genomic sequences. The second thrust leverages the statistical properties arising from mutation models, such as repeat structures, to develop efficient compression tools. In addition to improving compression methods, the success of these research directions will enhance our understanding of complex sequence generation processes, enable the generation of faithful synthetic data, and facilitate the quantitative study of the role of mutations in generating novel biological functions. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →
CRII: CIF: Model-based Compression of Biological Sequences · GrantIndex