Discovering What Matters: Informative and Reproducible Variable Selection with Applications to Genomics
Stanford University, Stanford CA
Investigators
Abstract
This project will develop statistical methods to discover which variables, in a large collection, are meaningfully related to an outcome of interest. An example of the problem is the identification of which genetic variants, among the millions we measure, influence disease risk. The methods developed will allow analysis of all variables at the same time, accounting for their interdependence, and leading to the identification of "actionable" ones. The approaches put forward come with the guarantee that, on average, a large fraction of the discovered features truly influence the outcome. The ability to correctly identify important variables will increase knowledge in many domains, and allow experts to devise interventions. For example, understanding which of the variables recorded on a patient are more relevant with respect to his/her response to therapy, can help develop personalized medical interventions with a higher success rate. The methods developed will enlarge the tool-box available to statisticians and data scientists as they attempt to extract meaningful information from datasets comprising a very large number of variables. The approach builds on the "knock-off" framework, a very flexible and novel approach that does not require specifying a model for the relation between an outcome of interest and possible co-variates. The inferential guarantees provided are on the selected variables, with control of the False Discovery Rate (FDR), where a discovery is considered false if a selected variable is independent of the outcome given the remaining covariates. This provides assurance on the reproducibility of results, as well as on their interpretability. The approaches developed will be used to analyze genetics datasets with the goal of obtaining more complete models of how DNA variation influences medically relevant phenotypes. This project is supported by the Division of Mathematical Sciences and the Division of Molecular and Cellular Biosciences.
View original record on NSF Award Search →