Post Model Selection Inference and Empirical Bayes Methods

$400,000FY2010MPSNSF

University Of Pennsylvania, Philadelphia PA

Investigators

Lawrence D Browncontact Andreas Buja Linda Zhao

Abstract

Consider a standard Gaussian multiple regression model involving p independent covariates. In many applications a first step of the analysis is to reduce the data via model selection to one containing only a subset of these possible predictors. If the covariates are correlated, conventional inference based on the selected model may be invalid; for example, probabilities that confidence intervals cover the true parameter values for the selected model may be grossly overstated. The investigators propose a version of classical inference criteria and a corresponding method for guaranteeing that post selection inferences will be valid within these criteria. The inference is conservative in that it is valid independent of the model selection method that was used, and correct (though possibly conservative) marginal coverage is guaranteed for all parameter configurations. The procedure is algorithmically easy to describe. However in its optimal implementation requires numerical estimation of certain probabilities related to high dimensional Gaussian distributions, and feasible computation of these probabilities for larger values of p is an issue still under investigation. Notwithstanding certain useful asymptotic bounds can be derived, and some important special cases can be analyzed with greater precision. Conventional statistical inference requires that a model of how the data were generated be known before the data are analyzed. Yet in applications involving such common procedures as the Analysis of Variance and multiple regression it is often the case that one or more model selection procedures are first undertaken in order to help determine a model for the analysis. This model selection is then followed by statistical tests and confidence intervals computed as if the final model had been chosen in advance of examining the data. Examples abound in the social sciences, in the econometric literature, in epidemiology and in genomics. This proposal begins by examining consequences of such a practice in order to categorize the degree to which it may be misleading and misguided. Without additional care the parameters being estimated are no longer well defined, and post-model-selection sampling distributions have properties that are very different from what would be the case without model selection. Statistical inference such as confidence intervals and statistical tests does not perform as is customarily assumed. Many authors have noted some or all of these problems, but have not proposed valid general statistical inference procedures to cope with the situation. The investigators propose and study a method that produces valid statistical inference within the models selected based on the observed data. The proposed approach is universally valid, independent of the procedure that was used to select the variables to be retained in the model. Thus, from this perspective it is not necessary to investigate the details of the various model selection proposals in current use. Nevertheless, certain models and model selection procedures do yield improved performance of our confidence interval proposal, and some aspects of this will naturally be included in our research. In particular some new model selection methods based on nonparametric Bayesian ideas will be investigated both for their ability to flexibly produce satisfactory models and from the perspective of post model selection inference. Extension of these post model selection ideas will also be explored in a variety of statistical settings beyond the most common Gaussian linear models that are the initial target of this proposal.

View original record on NSF Award Search →