BIGDATA: F: Open-World Foundations for Big Uncertain Data

$432,202FY2016CSENSF

University Of California-Los Angeles, Los Angeles CA

Investigators

Abstract

Driven by the need to learn from vast amounts of text data, efforts throughout natural language processing, information extraction, databases, and AI are coming together to build large-scale knowledge bases. These systems continuously crawl the web to extract relational data from text, and have already populated their databases with millions of entities and billions of tuples. Large-scale probabilistic knowledge bases are revolutionizing the way we access data. They are now routinely used by scientists to build knowledge bases of publications, by law enforcement to extract information from the dark web, and by regular search engine users who find their results augmented with structured information. Such knowledge bases are inherently probabilistic: to go from raw text to structured data, a sequence of statistical machine learning techniques associate probabilities with database tuples. This project revisits the semantics underlying such systems, and provide a more adequate foundational framework. In particular, the closed-world assumption of probabilistic databases, that facts not in the database have probability zero, clearly conflicts with their everyday use, and obstructs the progress in this area. More specifically, this project develops a new semantic foundation based on the open-world assumption, that facts not in the database are possible, but have unknown probability. It designs the basic algorithms for query answering in this setting, both exact and approximate. Moreover, in a deep theoretical component, this project studies fundamental questions of data and domain complexity that are unique to open-world reasoning about big uncertain data. Finally it develops proof-of-concept applications in machine learning and data mining, and additional knowledge-representation layers that strengthen open-world reasoning. The developed semantics provide meaningful answers when some tuple probabilities are not precisely known. The developed algorithms allow for efficient query answering, even when reasoning about the open world, in time linear in the database size for tractable queries. This project provides a scientific leap at the fundamental, semantic level. It also provides a context for training undergraduate and graduate students in subjects spanning databases, artificial intelligence, theory, and machine learning, and will target the integration of probabilistic knowledge bases into computer science curricula.

View original record on NSF Award Search →