High-Dimensional Random Forests Learning, Inference, and Beyond

$249,999FY2023MPSNSF

University Of Southern California, Los Angeles CA

Investigators

Abstract

Random Forests are one of the most popularly used computational methods for making predictions. The approach works by creating a group of decision-makers, like a team of experts, and then aggregates the individual predictions by these experts to form the final prediction. The great success of Random Forests has been verified by the superior performance when applied to many different types of data. Despite the tremendous success, Random Forests are still largely regarded as a Black-box method because of the limited theoretical understanding of it. The complicated nature of the algorithm and lack of theoretical understanding also make the results it produces less reproducible and hard to interpret. The project will theoretically study the properties of Random Forests to understand when the algorithm works, and more importantly, when the algorithm fails. Such studies can provide practitioners with more confidence and better guidance in applying Random Forests. The project will investigate how to improve the interpretability of Random Forests. Finally, with the understanding gained from these studies, the project will study how to improve the performance of the algorithm to make it even more useful for big data analysis. These research activities will offer numerous training initiatives for professional development of the next generation of statisticians and data scientists. Recently, there has been made important progress in the analysis of random forest algorithms, for instance, proof of the polynomial consistency rate of the original version of Random Forests in the high dimensional setting, without making specific assumptions of the regression function and feature distribution. Yet, there are still many fundamentally important questions left unanswered. The overall objective of this project is to provide an in-depth understanding of complicated ensemble methods such as Random Forests, and provide improved, interpretable, and reproducible statistical estimation and inference results. The project will first study some important open questions about Random Forests, and then move to the statistical inference. In particular, recent studies have confirmed that Random Forests can adapt to sparse models. A natural question is how to undermine the underlying true sparsity structure. Furthermore, some preliminary results suggest that popular existing methods are biased when there exists feature collinearity. The project will develop valid feature importance measures and further investigate the calculation of p-values for evaluating conditional feature importance in the existence of feature collinearity. The project will also move beyond Random Forests and study the larger problem of the conditional independence test. Utilizing the insights gained from these theoretical studies, the project will further develop an improved ensemble learning method for better prediction, interpretability, and reproducibility in big data analysis. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →