SI2-SSE: Collaborative Research: High Performance Low Rank Approximation for Scalable Data Analytics

$387,281FY2016CSENSF

Georgia Tech Research Corporation, Atlanta GA

Investigators

Abstract

Big Data analytics is at the core of discovery covering vast areas such as medical informatics, business analytics, national security, and materials sciences. This project aims to model some of the key data analytics problems and design, verify, and deploy scalable methods for knowledge extraction. The algorithms developed will be able to handle data sets of extreme sizes and will be deployable on advanced computer hardware. The goal is to realize orders-of-magnitude improvements over existing data analytics technologies, developing algorithms that are robust to incompleteness, noise, ambiguity, and high dimension in the data. Particular focus will be parallel and distributed algorithms that can efficiently solve large problems and produce accurate solutions. The proposed research and software development will allow domain experts to tackle Big Data sets requiring large parallel systems. The improved performance will enable fast and scalable data analysis across applications, from social network analysis to study citizens' attitudes toward sustainability-related issues to computational marketing techniques that refine customers' shopping experiences. The proposed work will help bridge the gap between computational science and data analytics ecosystems, two fields that stand to make great advancements from cross-fertilization. The education and outreach plan includes graduate course creation, engagement of under-represented groups via both undergraduate and graduate research experiences, and community-building efforts by workshop and mini-symposium organization. With the advent of internet-scale data, the data mining and machine learning community has adopted Nonnegative Matrix Factorization (NMF) for performing numerous tasks such as topic modeling, background separation from video data, hyper-spectral imaging, web-scale clustering, and community detection. The goals of this proposal are to develop efficient parallel algorithms for computing nonnegative matrix and tensor factorizations (NMF and NTF) and their variants using a unified framework, and to produce a software package called Parallel Low-rank Approximation with Nonnegative Constraints (PLANCK) that delivers the high performance, flexibility, and scalability necessary to tackle the ever-growing size of today's data sets. The algorithms will be generalized to NTF problems and extend the class of algorithms we can efficiently parallelize; our software framework will allow end-users to use and extend our techniques. Rather than developing separate software for each problem domain and mathematical technique, flexibility will be achieved by characterizing nearly all of the current NMF and NTF algorithms in the context of a block coordinate descent framework. Using this framework the shared computational kernels can be separated, which usually extend run times, from the algorithm-specific computations. Finally, the usability and practicality of the proposed software will be maintained by being application driven, establishing collaborations with early end-users, and by incrementally generalizing the framework in terms of both algorithms and problems.

View original record on NSF Award Search →