Doubly-robust variable selection in high dimensions

$225,000FY2023MPSNSF

University Of Pennsylvania, Philadelphia PA

Investigators

Abstract

The project will develop improved methodologies for the variable selection problem. The goal is to choose which explanatory variables are associated with an outcome variable of interest. Variable selection is a fundamental statistical problem that arises in countless application areas: Which genes influence the prevalence of a given human disease? Which demographic and socioeconomic variables influence a person’s income? Which electronic health record entries influence future medical costs? Especially when the number of potential explanatory variables is large, it is difficult to separate the important variables (the signal) from the irrelevant ones (the noise). The variable selection problem is also computationally challenging. Both of these issues hamper the progress of researchers in analyzing their data quickly and reliably. The project will address these important challenges by developing several methodological innovations and distributing the resulting improved methods in an open-source software package. The project will also provide multiple training opportunities to graduate students by involving them in the interdisciplinary research activities. A wealth of research has gone into the high dimensional variable selection problem and the associated conditional independence (CI) testing problem, including model-X (MX) inference, debiased lasso inference, double regression CI testing, and semiparametric/causal inference. In the context of the CI testing problem (i.e. the problem of assessing whether a single variable is associated with the response given the others), the investigator has recently led an effort to break down the barriers between the distinct lines of existing work. Moving from the CI testing problem to the variable selection problem, the project will develop a methodology satisfying an exhaustive set of concrete statistical and computational requirements, all of which are not met by any existing methodology. First, the project will develop a procedure to construct null distributions for CI tests that are accurate in small samples (like the MX conditional randomization test) but require only a limited number of resamples (giving speed comparable to that of double regression approaches). Second, the project will develop a variable selection method that requires only a single machine learning step (like MX knockoffs) but is doubly robust (like double regression methods). Third, the project will develop and comprehensively evaluate Doubly Robust Variable Selection (DRVS), a best-of-all-worlds variable selection methodology incorporating the above two innovations. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →