CRI: Global-scale Data Sharing using Statistics and Probabilities

$249,508FY2005CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

Abstract Proposal: CNS 0454425 PI: Dan Suciu Institution: University of Washington Program: NSF 04-588 CISE Computing Research Infrastructure Title: CRI: Global-scale Data Sharing using Statistics and Probabilities This project will address the problem of semantic heterogeneity that occurs in large-scale data integration by exploring the scalability of novel techniques to very large amounts of data. Two such techniques will be considered. One is corpus-based schema matching, where a large collection (corpus) of schemas is stored, analyzed, and preprocessed in order to enhance automatic schema matching. The second technique consists of probabilistic-based query answering, which efficiently computes complex SQL queries on probabilistic databases. To study the scalability of these techniques to large-scale data integration tasks, a significant fragment of the Web will be downloaded, and stored locally, on a cluster of servers. Data instances and their schemas will be extracted automatically from these Web pages. The resulting corpus of schemas will be matched using a variety of techniques, and the matches interpreted probabilistically. The resulting data organization is called the semantic cache. Users will be able to formulate rich queries over the semantic cache, for example in a language like SQL. Each query will be evaluated on the global data, and given a probabilistic interpretation. The answers will be returned to a user ranked according to their probabilities. This project has potential, if successful, to impact a variety of applications where large scale data integration is currently impossible to achieve, such as from scientific data sharing, electronic commerce, and emergency management systems.

View original record on NSF Award Search →