III: Small: Towards Agile Information Integration for Large Scale-- Data Aware Indexing and Search over Unstructured Data
University Of Illinois At Urbana-Champaign, Urbana IL
Investigators
Abstract
This proposal aims to enable scalable and adaptable information integration over unstructured data at a large scale. There is a need to be able to do structured queries with unstructured data, for example in executing SQL queries over Web pages. This project will develop a new approach of "query push-down," distinctive from the conventional "data pull-up" techniques, as a promising direction for accomplishing agility in integration. The technical objectives will be driven by two application domains: Army land planning and the Illinois digital library. The team will develop query translation techniques that "pushes down" queries to a format that can be executed over unstructured document and feature indexes. This approach will eliminate expensive, inflexible, and often fragile extraction of unstructured data, enabling scalable and adaptable information integration through "best effort" semantics. In the query push-down approach, queries are no longer executed by the SQL-like Boolean semantics, but would rather take a maximum likelihood interpretation-- i.e., what are the most likely answers, by properly translating a given query, under the presence of uncertainty and lack of preciseness in data? The team will study the formalism that governs the principles of such probabilistic query execution, for achieving "best effort" with probabilities as a formal quality metric. Researchers will build the Data-oriented Content Query System , which will support users of Web data not only keywords but also data types to query for relevant values of their desired data in the contents of the corpus, by specifying flexible patterns and customizing scoring functions. Structured queries will be translated for executing in the system to access and integrate the unstructured contents in the corpus. The successful results in this proposed research will have significant impacts in two areas. The research community has observed the scalability limitation of the current integration schemes. These observations highlight the urgency of the proposed study for developing large-scale, agile integration techniques. This will formally advance the understanding of large-scale best-effort integration and develop a set of general techniques. Second, the development of the query system engine will provide access to the data-rich Web, with practical deployment at the Illinois Gateway of the UIUC digital library, which will improve students and faculty?s access to online scholarly and open information. Students will be directly involved in the research effort and new curricula are planned.
View original record on NSF Award Search →