III-COR-Small: Beyond Keyword Search: Enabling Diverse Structured Query Paradigms over Text Databases

$448,976FY2008CSENSF

Columbia University, New York NY

Investigators

Abstract

The text available on the Web and beyond embeds unprecedented volumes of valuable structured data, "hidden" in natural language. For example, a news article might discuss an outbreak of an infectious disease, reporting the name of the disease, the number of people affected, and the geographical regions involved. Keyword search, the prevalent query paradigm for text, is often insufficiently expressive for complex information needs that require structured data embedded in text. For such needs, users (e.g., an epidemiologist compiling statistics, as reported in the media, on recent foodborne disease outbreaks in a remote country) are forced to embark in labor-intensive cycles of keyword-based document retrieval and manual document filtering, until they locate the appropriate (structured) information. To move beyond keyword search, this project exploits information extraction technology, which identifies structured data in text, to enable structured querying. To capture diverse user information needs and depart from a "one-size-fits-all" querying approach, which is inappropriate for this extraction-based scenario, this project explores a wealth of structured query paradigms: sometimes users (e.g., a high-school student in need of some quick examples and statistics for a report on recent salmonella outbreaks in developing countries) are after a few exploratory results, which should be returned fast; some other times, users (e.g., the above epidemiologist investigating foodborne diseases) are after comprehensive results, for which waiting a longer time is acceptable. The project develops specialized cost-based query optimizers for each query paradigm, accounting for the efficiency and, critically, the result quality of the query execution plans. The technology produced will assist a vast range of users and information needs, by enabling efficient, diverse interactions with text databases -- for sophisticated searching and data mining -- that are cumbersome or impossible with today's technology. The research and educational components of the project will rely on -- and encourage -- a tight integration of three complementary Computer Science disciplines, namely, natural language processing, information retrieval, and databases. The project will also provide data sets and source code, for experimentation and evaluation, to the community at large over the Web (http://extraction.cs.columbia.edu/).

View original record on NSF Award Search →