III:Small: Towards Cross-Model Query Optimizations for Multi-model Heterogeneous Data Analytics

$443,183FY2019CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Abstract

Large-scale analysis of complex, heterogeneous datasets is now an integral part of various social and natural sciences, digital journalism, law, enterprises, and numerous other application domains. Users in such fields are increasingly grappling with the need to perform holistic integrated analytics spanning a variety of data models beyond just structured or semi-structured data to include graph data, text data, etc. Such multi-model data repositories are also growing in volume due to the widespread availability of online data sources such as social media and news media, which have opened up new avenues for insight in various domains. To take advantage of these opportunities, it is necessary to develop joint understanding and processing of at least three data models - relations, graphs, and text - including their evolution over time. This project aims to enable faster and scalable cross-model data analytics. An emerging information architecture for such heterogeneous data problem is the "polystore" approach that uses multiple "uni-model" backend engines such as RDBMSs, graph DBMSs, Solr, etc., and provides a translation layer in the middle to farm out different parts of a cross-model query to different engines. This approach is gaining popularity because it allows us to exploit the full functionality and native performance of uni-model engines for the corresponding parts of the queries. Amongst polystores, there are loosely-coupled solutions that have a very thin processing layer whose task is to "stitch the parts" together, and primarily provide support for data placement, movement and transformation. This project will focus on the query architecture and optimization principles for a tighter-coupled polystore. A usable, efficient, and scalable data analytics platform for queries spanning three data models, viz., relations, graphs, and text (including temporal evolution), that arise from social media and other sources, will be designed. A cross-model dataflow optimizer will be created for this "tri-store" setting to study fundamental systems optimization principles and will be implemented within the AWESOME polystore system. Further, several novel cross-model query optimization techniques will be devised to exploit the semantics of these three data models. Special attention will be paid to the temporality of data such that the optimizations treat temporal evolution of the data as a first-class primitive and support such queries efficiently on top of the existing engines even though they may lack native support for temporal queries. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →