Novel compression strategies for efficient storage, retrieval and analysis of sequence data

$308,794R01FY2025GMNIH

Temple Univ Of The Commonwealth, Philadelphia PA

Investigators

Abstract

Project Summary/Abstract A post-genomic era is upon us, marked by the rapid availability of massive sequence data. Storing, retrieving, and analyzing such high volumes of raw sequence data, as well as alignment information, pose signiï¬cant challenges to researchers in the ï¬eld. Consequently, new methods are desirable to support efï¬cient storage and retrieval of such data to meet the demand of the exponential growth in sequence data. Within this scope, we will develop a suite of novel computational methods for compressing sequence data. Our proposed research includes three speciï¬c aims to develop: (1) an error-bounded lossy compressor for quality scores in sequence ï¬les, (2) an adaptive compression strategy to achieve optimal performance, and (3) a high-throughput compression framework for efï¬cient sequence compression. We will extensively evaluate the proposed approaches on real and synthetic datasets, and develop modularized software tools to facilitate the application of the proposed methods. With a well-deï¬ned research plan for developing innovative methodologies to compress sequence data, this project will establish a new paradigm of data compression to support efï¬cient storage, retrieval, transfer, and analysis of the ever-growing DNA and RNA sequencing data.

View original record on NIH RePORTER →