Robust and Distributed Statistical Learning from Big Data

$600,000FY2017MPSNSF

Princeton University, Princeton NJ

Investigators

Abstract

Big Data are ubiquitous in many areas of science, engineering, social sciences, and the humanities, and have significant impact in terms of technological innovation and economic development. This project seeks to introduce effective methods for robust high-dimensional statistical inference that are insensitive to the potential poor quality of big data, and to develop distributed estimation that is needed for Big Data analysis, computing, and optimization. The research will address several robust and distributed statistical inference problems for Big Data in genomics, genetics, neuroscience, machine learning, economics, and finance. The project will advance our understanding of molecular mechanisms, biological processes, genetic associations, brain functions, and economic and financial risk. Integration of research and education will be achieved through the involvement of undergraduate students, graduate students, and postdoctoral fellows, and the development of publicly available computer code for robust and distributed analysis of Big Data with sound theoretical support. Working closely with industrial partners, the research will lead to increased collaborations between academia and industry. The project will lead to the development of novel statistical theory, methods, and algorithms for robust statistical inference from high-dimensional statistics and Big Data. The first aim seeks to introduce a simple and widely applicable principle for robust inference via an appropriate shrinkage of observed data or loss functions. This reduces the influence of outliers and heavy-tailed distributions, and weakens the moment conditions from sub-Gaussian distributions to bounded second moments for regression or fourth moments for covariance estimation. The research includes plans to systematically develop the theory and methods for robust estimation of high-dimension means, and implementation of these methods to control false discovery rates in large scale inference for gene and transcripts selection, robust regularization of covariance and precision matrices, and their applications to robust principal component analysis, factor analysis and high-dimensional hypothesis testing. In addition, robust sparse regression, model selection, and low-rank matrix recovery will also be investigated. The second aim focuses on making the proposed robust procedures applicable to the Big Data environment via the development of distributed estimation and inference. In particular, divide-and-conquer methods will be used to distribute the computation to node machines and to solve privacy and data ownership issues. Approaches to reduce the information loss due to the distributed computation for likelihood based models via partial communication of the Hessian matrices will be investigated. Two important classes of problems, trace regression and principal component analysis, will be used to illustrate the proposed methods.

View original record on NSF Award Search →