CRII: III: Optimal Data Organization for Hybrid Transactional/Analytical Processing Data Systems

$174,999FY2019CSENSF

Trustees Of Boston University, Boston

Investigators

Abstract

Scientific, commercial, and governmental applications increasingly rely on data-driven insights and decision-making using both historical data and real-time updates. New workloads are generated by social feeds, sensor readings (a common use-case of Internet-of-Things applications) and electronic micro-payments (an emerging model of e-commerce). They all have in common: (i) a very high volume of transactions and (ii) a high volume of analysis queries that need to use both historic and real-time data to provide useful and actionable insights. The primary challenge is that these workloads have conflicting requirements, and typically use different data systems architectures. On the one hand, we want to be able to answer analysis queries like, "what was the most discussed topic in each month of the past year?", or "what is the average power consumption per neighborhood of city X?". On the other hand, we want to efficiently store incoming updates and be able to provide real-time insights like "where do we have a power network overload now?", or "what is the probability that a disaster is taking place based on the social feeds of a city X?". Traditionally, data systems were engineered to efficiently support either a transactional workload -- that is, storing quickly new items -- or an analytical workload. The latter typically includes changing the data layout and organization, and building auxiliary indexing structures to allow for efficient data access. The emergence of complex workloads has pushed towards the need to develop new systems that can support hybrid transactional/analytical processing (HTAP). This research will allow to execute such workloads efficiently and to anticipate workload changes in a robust way. Ultimately, the project will make data ingestion and data analysis a smoother process and will enable complex applications to have their data analyzed quickly. The researchers will build data systems that can efficiently evaluate mixed workloads by navigating the read-optimized vs. update-optimized continuum of data systems architectures. The key to do so is to vary the physical data organization and find the optimal for each use-case. Typically, data objects are physically organized in various ways between two extremes: either they follow the ingestion order, that is, the way they are generated or inserted in the system, or they are organized based on their value (or a specific subset of their attributes). This "structure" (also called "bounded disorder" in the literature) is treated as a continuum between the two extremes. In-between, hybrid data organizations have different parts of the dataset organized with different schemes. Transactional updates add data with disorder, while answering analytical queries efficiently requires data with bounded disorder. A fundamental challenge today is to find the data organization that enables a data system to offer a tunable balance between efficient updates and fast analysis queries. This project addresses this challenge from three different angles. First, by formulating an optimization problem, which can be solved at run-time. Second, by formulating a robust optimization problem which will deliver good performance even when preliminary assumptions are not accurate. Third, by building access methods that can exploit any inherently limited disorder in the underlying data to reduce the data organization effort needed for efficient analysis tasks. This research effort introduces HTAP data systems that can optimally organize data and exploit inherently bounded disorder while being robust in workload changes. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →