III: Small: Managing Large-scale Uncertain Data Repositories

$498,538FY2009CSENSF

University Of Maryland, College Park, College Park MD

Investigators

Abstract

Increasing numbers of real-world application domains are generating data that is inherently noisy, incomplete, and probabilistic in nature. Examples of such data include measurement data collected by sensor networks, observation data in the context of social networks, scientific and biomedical data, and data collected by various online cyber-sources. The data uncertainties may be a result of the fundamental limitations of the underlying measurement infrastructures, the inherent ambiguity in the domain, or they may be a side-effect of the rich probabilistic modeling typically performed to extract high-level events from sensor and cyber data. Similarly, when attempting to integrate heterogeneous data sources ("data integration") or extracting structured information from text ("information extraction"), the results are approximate and uncertain at best. However, there is currently a lack of data management tools that can reason about large volumes of uncertain data, and hence the information about the uncertainty is often either discarded or reasoned about only superficially. In this project, we are building a complete probabilistic data management system, called PrDB, that can manage, store, and process large-scale repositories of uncertain data. PrDB unifies ideas from "large-scale structured graphical models" like probabilistic relational models (PRMs), developed in the machine learning literature, and "probabilistic query processing", studied in the database literature. PrDB framework is based on the notion of "shared factors", which not only allows us to express and manipulate uncertainties at various levels of abstractions, but also supports capturing rich correlations among the uncertain data. PrDB supports a declarative SQL-like language for specifying uncertain data and the correlations among them. PrDB also supports exact and approximate evaluation of a wide range of queries including inference queries, SQL queries, and decision-support queries. The cross-disciplinary research undertaken during this project will enable us to simultaneously address the challenges in the areas of probabilistic databases and machine learning, and allow us to transfer the key technologies developed between those areas, thus advancing the research in both areas. It will enable the development of a significant and high-impact new class of real-world applications, in a variety of domains including health informatics, social media management, World Wide Web, and scientific databases. The PrDB system source code, and the datasets generated during the project, will be released using an appropriate open source license, at the project web site: http://www.cs.umd.edu/db/PrDB.html

View original record on NSF Award Search →