Collaborative Research: Consistent Risk Estimation under High-Dimensional Asymptotics

$69,999FY2018MPSNSF

Cuny Baruch College, New York NY

Investigators

Abstract

Learning from large datasets has been the cornerstone of modern innovations and discoveries in science, medicine, and technology. Fast prediction of unseen events is a canonical goal in statistical learning. A classic approach to this end is leave-one-out cross-validation, a time-consuming routine of leaving a datum out, fitting the model on the rest, and testing it on the left out datum, repeatedly. The recent emergence of massive data has exacerbated the computational infeasibility of such approaches. Moreover, in many recent instances, the number of features per observation can be extremely large, adding another challenging facet to the fast estimation of prediction error. To overcome these problems a new set of scalable and consistent risk estimators will be developed in this project. The importance of risk estimation has motivated this project of different schemes, such as cross-validation, Stein's unbiased risk estimation (SURE), Generalized cross-validation, Akaike Information Criterion (AIC), and Bootstrap. The emergence of high-dimensional datasets has challenged most classical approaches to risk estimation. For instance, the large discrepancy between in-sample and out-of-sample prediction error, in applications involving predictions based on previously unseen features, makes it hard to rely on popular estimators, such as SURE or AIC, in high-dimensional regimes where the number of predictors is smaller than or at the same order as the number of observations. On the other hand, the information value of a datum in these regimes (as opposed to the information value of a datum in low-dimensional settings) casts doubt on the reliability of other techniques, such as 5-fold cross-validation. The project offers a novel theoretical framework to find the middle ground between scalability and reliability, and specifically, to obtain theoretically consistent and computationally efficient risk-estimation schemes under high-dimensional settings. Since risk estimation is at the core of areas including but not limited to machine learning, signal processing, medical imaging, neuroscience, and social and environmental sciences, any success in this project will lead to reliable and immediate scientific discoveries and better learning systems. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →