SPX: Collaborative Research: Multicore to Wide Area Analytics on Streaming Data
Carnegie Mellon University, Pittsburgh PA
Investigators
Abstract
In today's big data era, there is an urgent need for methods that can quickly derive analytical insights from large volumes of data that are continuously generated. Such streaming data include video, audio, activity logs, and sensor data, and are generated on a massive scale all over the world. The need for real-time streaming analytics can only be fulfilled with the help of appropriately designed parallel and distributed algorithms. However, parallel and distributed computing systems come in a variety of shapes and sizes, and algorithms should be designed to match the characteristics of the underlying system. This project develops methods for analyzing massive streaming data on computing systems ranging from machines with multiple cores sharing memory to geo-distributed data centers communicating over wide-area networks. The results of this research are expected to improve the efficiency, latency, and throughput of streaming analytics. Due to the foundational nature of the analytical tasks considered, results of this project will impact disciplines that use large-scale machine learning and graph analytics, including cybersecurity, social network analysis, and transportation. Resulting software will be released as toolkits on stream processing platforms, and deployed in a smart-city camera infrastructure. Synergy between the research goals and the teaching goals of the PIs will lead to new instructional material in existing courses as well as development of new courses in data analytics. Individuals from underrepresented groups will be included as a part of the project. The project will benefit from and strengthen collaborations between academia, industry, and national labs on streaming analytics. The first technical thrust of the project is on designing shared memory parallel algorithms for computation on data streams, that can achieve a high throughput and fast convergence for complex analytics tasks. The second thrust is on designing distributed streaming algorithms that can tolerate variable communication delays and adapt to available bandwidth in a wide-area network, through identifying good tradeoffs between freshness of results and volume of communication. These advances will be studied in the context of fundamental graph analytics and machine learning tasks such as subgraph counting, graph connectivity and clustering, matrix factorization, and deep networks. The project will utilize the vast body of theory and techniques developed in the realm of parallel computing in the design of methods for processing streaming data, leading to a toolkit of techniques that can be reused across applications. The project will also lead to advances in sequential streaming and incremental algorithms for certain problems; for instance, problems in machine learning that use iterative convergent methods. Based on the techniques designed, the project will design and build a hierarchical parameter server that operates effectively across the spectrum from multicore machines to data centers to wide-area data sources.
View original record on NSF Award Search →