Deep Learning and Random Forests for High-Dimensional Regression

$157,960FY2020MPSNSF

Princeton University, Princeton NJ

Investigators

Abstract

This project aims to investigate two of the most widely used and state-of-the-art methods for high-dimensional regression: deep neural networks and random forests. Despite their widespread implementation, pinning down their theoretical properties has eluded researchers until recently. The proposed research aims to add to the growing body of literature on their analysis, by both developing tools of theoretical value and providing guarantees and guidance for practitioners and applied scientists who use these popular methods frequently in their work. The success of multi-layer networks has largely been buoyed by their ability to generalize well despite being able to fit most datasets, given enough parameters. This phenomenon is particularly striking when the input dimension is far greater than the available sample size, as is the case with many modern applications in molecular biology, medical imaging, and astrophysics, to name a few. A major component of the proposed work will be to obtain complexity bounds for classes of deep neural networks with controls on the size of their weights, which can then be used to bound generalization error and statistical risk. These complexity bounds reveal the role of complexity penalization, which is based on certain norms of the weights of the network. Motivated by these observations, another stream of the proposed research seeks to provide statistical guarantees of certain complexity penalized estimators and their adaptive properties. Current theoretical results for random forests are either for stylized versions of those that are used in practice or are asymptotic in nature and it is therefore difficult to determine the quality of convergence as a function of the parameters of the random forest. Furthermore, the setting for the analysis of more practical implementations of random forests is limited to structured, fixed-dimensional regression function classes. Given these restrictions, the first component of the proposal aims to investigate how random forests behave in the high-dimensional regime when the number of predictors grows with the sample size. Another research objective is to isolate and study families of flexible high-dimensional regression functions for which finite sample convergence rates can be established. The final endeavor of this project is to connect popular measures of variable importance to the bias of random forests. Since variable importance measures are used for assessing the role each predictor variable plays in influencing the output, this connection will partially explain why random forests are adaptive to sparsity. The relationship will also help to theoretically motivate variable importance measures as useful tools for model interpretability. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →