AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing

$414,034FY2023CSENSF

North Carolina State University, Raleigh NC

Investigators

Abstract

Being able to store, search, and analyze massive data sets efficiently is one of today's pressing challenges. To that end, this project will study a collection of problems under text compression and indexing with tremendous current relevance, owing to a specific characteristic prevalent in many modern text data sets, called high repetitiveness. This characteristic makes the data highly compressible using some specialized schemes. However, the theoretical understanding of those schemes is still in a nascent stage. This project will address some of the fundamental open problems on the effectiveness of several schemes that are popular in practice. This project will also introduce new ideas for indexing such data in a highly space-efficient manner and quickly support various (application-specific) queries. The techniques developed in this project will apply to a broad class of algorithmic problems and applications; therefore, they will have a lasting impact on related fields. The main results stem from this research will be disseminated through major conferences, journals, and workshop tutorials. The participation of undergraduates and minorities, including women, will be ensured. Suffix trees and suffix arrays are two fundamental data structures for text indexing with many applications in bioinformatics. However, they are notorious for their space complexity. The FM-index index encodes suffix arrays in space close to the entropy-based lower bound using the Burrows-Wheeler Transformation (BWT). But, the entropy model of compression is less effective in capturing repetitiveness. Therefore, modern applications urge even more space frugal encodings that exploit repetitiveness. Although the community has made some recent progress in exact pattern matching, solutions to many other (more complex) matching problems are still open. To that end, this project will attempt to design repetition-aware indexes for pattern matching under mismatches, edits, and wildcards, along with efficient construction algorithms. This project also aims to develop a unified framework to compress several advanced suffix-tree variants (quasi-suffix trees) that support even more complex matching paradigms such as parameterized and order-isomorphic matching. Novel techniques and concepts that go beyond the existing BWT-based methods will be sought. Additionally, the project will attempt a deeper study on various aspects of BWT-runs and the relative Lempel-Ziv compression scheme from the hardness side. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →