GGrantIndex
← Search

NSF-BSF: III: Small: Data Driven Schema

$500,000FY2021CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

Data needs to be organized systematically and rigorously. Data about consumer products goes into one table, and data about micro-organisms goes into a different table. This makes it easier for humans and computers to store the data, to retrieve and query it, and to update it. But today one often finds large amounts of noisy, inconsistent, incomplete data, which are impossible to organize rigorously. The sheer volume of this data makes it very valuable, yet it is of limited utility without proper organization. This project develops methods for organizing noisy, inconsistent, incomplete data, and develops techniques for storing, querying, and updating such data. Its findings will inform organizations on how to organize and use large amounts of noisy data. This project develops a technique for approximate schema discovery for noisy data, for normalizing the data according to this schema, and for improving query processing. The input consists of a single, large relation, which may be noisy, inconsistent, incomplete, and the system discovers automatically a few candidate schemas that can represent the data with minimal loss and with high utility for downstream tasks. Each schema is associated with an information-theoretic score, which represents the amount of information that may be lost when we represent the data according to that schema. Then, the project researches new approaches for querying the data stored in an approximate schema, by either recording explicitly the number of "spurious tuples" generated by the schema, or by using probabilities to quantify the degree of confidence in the query's answer. The techniques explored in this project combine information theory with graph algorithms and with query optimization. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →