Personal and panel references for improved alignment

$2,349,722R01FY2025HGNIH

Johns Hopkins University, Baltimore MD

Investigators

Linked publications & trials

Abstract

PROJECT SUMMARY Next-generation sequencing is ubiquitous in the study of biology and disease. Analyzing a sequencing dataset involves taking DNA fragments and determining where each originated with respect to a reference genome, or with respect to a large collection of sequences like a pangenomic or taxonomic database. These databases are growing rapidly, thanks to increasing availability of high-quality genome assemblies. But this creates a tension: on the one had, our analyses should take full advantage of these sequences to maximize sensitivity and avoid reference bias. On the other, our standard algorithms and methods do not scale well to larger collections, necessitating larger computers, and more time and effort. We propose a four-aim project to greatly improve the scalability of our day-to-day sequencing data analysis tools. We build on the successful foundation of compressed full-text indexes: data structures that facilitate efficient classification and alignment of sequences, but whose size grows the amount of distinct sequence in the collection, i.e. the collection's compressed size rather than its raw size. A successful project will enable a future where (a) sequencing-based diagnostics and therapeutics are free from reference bias, (b) genomics algorithms run in time and space proportional to the data's compressed size rather than its raw size, (c) continued sequence database growth poses no major hurdle to research, because our methods eliminate redundancy and are not limited to particular k-mer lengths. All software and training materials will be available under an open source license, per our team's track record.

View original record on NIH RePORTER →