GGrantIndex
← Search

Novel p-Value Based Multiple Testing Methods for Variable Selection with False Discovery Rate Control

$269,695FY2022MPSNSF

Temple University, Philadelphia PA

Investigators

Abstract

Multiple testing is one of the most common statistical challenges encountered in modern scientific investigations. This project aims at resolving some longstanding issues with application of multiple testing methods. One of these issues arises in the context of discovering, among a large collection of variables, those that are important influences on an outcome of interest. Inapplicability of standard multiple testing methods due to the unknown interdependency of the variables is such an issue. The methods under development aim to provide new approaches to discovering important variables no matter how the variables depend on each other, with the guarantee that, on average, only a small, controlled fraction of unimportant variables end up as false discoveries. An example application is in the identification of genetic variants which, among many thousands of them, can influence a certain disease. The new methods can aid in identifying genes as being relatively more relevant for therapeutic intervention. The fundamental theoretical and methodological ideas behind the development of these methods will be extended towards resolving similar issues with multiple testing methods in other experimental settings as well. The research to be carried out in the project will be incorporated into courses, benefiting the training of undergraduates and graduate students. This research project is focused on addressing important theoretical and methodological issues related to multiple testing. For instance, feature/variable selection under the setting of multiple linear regression with Gaussian noise, which plays an important role in data science and is a ubiquitous statistical framework in scientific investigations, is often framed as a multiple testing problem. A p-value based multiple testing method, irrespective of what error rate is being considered to control the falsely discovered important explanatory variables, capturing the correlation matrix of the explanatory variables in full without losing control over the error rate, would be most ideal. Unfortunately, such methods are yet to be developed in a non-asymptotic setting. Similarly, for the related problem of simultaneous testing of multivariate Gaussian means with non-diagonal correlation matrix, subject to a control of an error rate, a p-value based multiple testing method fully capturing the correlation information without losing control over that rate is largely absent from the literature. The challenges will be met by research that cross-fertilizes two seminal ideas on multiple inference: 1) the use of p-value based multiple testing methods to control false discoveries; and 2) the use of the knockoff of the design matrix for variable selection in linear regression settings. Concretely, the project aims at developing novel p-value based false discovery rate and other powerful error rates controlling multiple testing methods for 1) variable selection in multiple linear regression with Gaussian noise, both in low- and high-dimensional settings; and 2) simultaneous testing of multivariate Gaussian means with a general non-diagonal covariance matrix. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →