AF: DC: Collaborative Research: Pattern Matching for Massive Data Sets

$500,000FY2010CSENSF

Louisiana State University, Baton Rouge LA

Investigators

Abstract

Pattern matching is a fundamental research field with applications in domains such as biological sequence alignment, web search engines and network intrusion detection. Given a pattern P and a text string T, the central problem is to find occurrences of P in T. When data becomes massive, we cannot assume that text can be stored in RAM. Pattern matching problems must now be considered with more appropriate models like external memory model, cache-oblivious model, streaming models, MapReduce paradigm and multi-core models. In many cases, a blend of models or newer, more appropriate models need to be developed keeping the practical aspects of the application in sight. The focus of this project is to develop efficient search algorithms and indexes when a data set resides on disks, on network storage, or is accessible only as an online stream. The data must be efficiently searchable even though it may be in compressed format. The project considers traditional pattern matching problem, as well as variants such as (i) approximate matching -- where the pattern may not exactly match a substring in T, (ii) online matching -- where the pattern(s) are known in advance and text comes as a stream, and (iii) string retrieval -- where instead of finding all the occurrences, the focus is on retrieving high ranking documents which contain one or more occurences of the query pattern. Issues of I/O efficiency and space utilization are central to this project. This involves developing suitable massive data set models, deriving optimal theoretical bounds and implementing practical tools. Methodologies include combinatorial and randomized methods in pattern matching, succinct data structures, top-k query processing and I/O efficient indexes. The project will build new, solid theoretical foundations in pattern matching, with direct applications to fields like databases and information retrieval. It will significantly drive forward current state of the art in web search engine technology (by impacting the way inverted indexes are used) and genome sequence alignment tools (e.g., BLAST). Tools and software developed during this project will be widely distributed to the research community. Some components will be incorporated into undergraduate and graduate algorithms course curricula as implementation projects.

View original record on NSF Award Search →