III: Small: Datalog with Aggregates: Complexity, Optimization, Evaluation

$600,000FY2023CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

Computer applications are increasingly data driven. They rely on processing large datasets from which they learn models that drive the application. Machine learning on massive datasets is routinely done today by leading software companies and academic institutions, however, using ML tools effectively is still an obscure practice mostly done by ML experts. Future applications of ML will be developed by data scientists, who need friendly tools to help them manage data of massive scale. The goal of this project is to lay the foundations for building such tools. It extends relational databases that are already widely used today with the ability to perform iterations that are indispensable in machine learning applications. Relational databases are some of the best engineered systems to date, and they are used routinely to process datasets from small to massive. But the query language that they support, SQL, is only optimized for queries that do not require iteration. Yet virtually all modern data science tasks require some form of iteration. As SQL does not support iteration well, data scientists do not use SQL for most of their needs. Datalog is a query language proposed decades ago, precisely to support iteration, however, datalog does not support aggregates, such as summation or counting, which are indispensable in any data science task. This project overcomes the fundamental roadblock that prevents datalog from supporting aggregates by using a new abstraction, where standard relations are extended to relations over a semiring. This modification allows all traditional SQL optimizations to be carried over to datalog, and at the same time it allows recursion and aggregates to be interleaved freely. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →