SHF: Small: Languages and Abstraction for Dynamic Big Data

$456,896FY2013CSENSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

The analysis of large datasets using computers, a.k.a., big data analytics, is emerging as an important tool in many fields, such as science and discovery, technology, health care, and commerce. The data sets used in such applications are usually dynamic: they change over time as new data becomes available. Such dynamic changes are often small, requiring similarly small but potentially important updates, because new information can be crucial in detecting a pattern or an anomaly. For example, the Internet or a social network changes dynamically as new web pages become available, new links are added, or existing links are removed. As a result of such dynamic changes, two clusters of previously disconnected web sites can become connected by the addition of a single link, indicating for example, an important news item or a security breach. Unfortunately, in many existing big-data systems, absorbing new information involves making one or more passes over the entire dataset. Such batch processing of dynamic data results in slow updates, as well as inefficiencies in the utilization of resources such as hardware and energy, by (unnecessarily) performing many subcomputations that are unaffected by changes. This project aims to lay the groundwork for the programming languages and software systems that can support the development of such applications in the real world. The work has the potential to transform the way the programmers express computations on dynamically changing big data sets, make it possible to derive new information and knowledge from big dynamic data sets by computing with them responsively and efficiently, and transform the way that we teach the design, analysis, and implementations of computations operating for dynamic data sets. The project also includes the development of undergraduate lectures on parallelism. The project aims to enable the user to express the dynamism in large data sets implicitly, without concerning themselves with how exactly the results will be updated when the data changes, e.g., which data depends on which other data, which data may need to be updated, which dependencies need to be reconstructed. Starting with an implicitly dynamic program, a software system automatically and efficiently constructs a record of the computed results and updates it as the dataset changes. To achieve this goal, the project develops abstractions, programming languages, compilers, and run-time systems. Concretely, we expect three sets of contributions: novel, powerful abstractions and cost models for writing programs that operate on dynamically changing large datasets, programming language support in the form of compilers and run-time systems for realizing such abstractions on practical hardware, and efficient algorithms and implementations, to be used to evaluate the proposed and future work.

View original record on NSF Award Search →