Model-Parallel Collaborative Filtering in Apache Spark

$68,790FY2015CSENSF

University Of California-Los Angeles, Los Angeles CA

Investigators

Abstract

With data rapidly growing in size and complexity, many organizations are eager to train collaborative filtering methods on massive datasets using distributed computing environments. For instance, Netflix has hundreds of thousands of online programs to recommend to its millions of users, and Facebook has millions of users who could potentially form new links between one another. However, leading methods introduce significant algorithmic challenges in the distributed setting. The PI proposes to study a novel algorithm designed to be efficient for large-scale data science applications. Preliminary studies demonstrate the promise of this method, and the PI proposes to formally characterize the algorithm's behavior, perform an extensive empirical evaluation, and incorporate ideas inspired by this proposal into an upcoming online course PI will be teaching. Collaborative filtering, and in particular matrix factorization, is a widely used method for devising recommender systems. However, the size of these models grows linearly with the number of users and items, and leading methods for matrix factorization introduce significant challenges in the distributed setting due to their high communication costs. The PI proposes to study a novel model-parallel algorithm designed for Apache Spark that leverages the sparsity of the underlying data to drastically reduce this communication burden. Preliminary studies demonstrate the promise of this method, and the PI proposes to formally characterize the algorithm's behavior, perform an extensive empirical evaluation, and explore the paradigm of model-parallelism in Spark more generally for other learning settings. The PI will also incorporate ideas related to model-parallelism inspired by this proposal into an upcoming MOOC that be taught on the edX platform.

View original record on NSF Award Search →