Shrinkage Methods for Variable Selection and Structure Discovery, with Applications to High Dimensional Data

$130,000FY2010MPSNSF

North Carolina State University, Raleigh NC

Investigators

Abstract

The proposed research addresses variable selection within the context of today's complex data structures. The first component of this project is to combine "supervised clustering" and variable selection into a single step. The goal is to facilitate the identification of important predictive clusters leading to the discovery of an underlying grouping structure. Secondly, this project proposes to introduce a new technique to tune variable selection methods which not only outperforms existing methods, but has an intuitive interpretation to non-statisticians. The third component of this project is to perform simultaneous variable selection and constrained quantile regression. In complex data, simply modeling the mean as a function of the predictors may not capture the full relationships. Quantile regression models the effect of the predictors on various percentiles of the response. The approach proposed in this project alleviates the well-known issue of crossing curves in quantile regression. Finally, in complex and high-dimensional data, outliers are almost certain to exist, so methods that are robust to these outliers are essential. The final component of this project is an approach to combine robust methods with variable selection via a direct weighting of observations. All four components of this research will be developed via a penalization, or equivalently, a constrained optimization, framework via appropriate choices of penalty. With the abundance of information now available in all scientific fields, it can be an overwhelming task to decide on which of the massive number of possible predictor variables to include in a model. Therefore, it is essential to develop techniques to perform variable selection. Penalization techniques to perform variable selection have gained increasing popularity and are routinely applied in diverse branches of subject-matter research including drug discovery, consumer marketing, environmental systems, financial markets, image processing, homeland security, genomics, proteomics, and metabolomics. It is often the case that the investigator has a number of goals in mind when performing a statistical analysis. A common example occurs in gene expression studies, where one may wish to perform subject classification, gene selection, and gene clustering, simultaneously. The proposed research is particularly geared toward enabling the accomplishment of these types of multi-faceted analyses. A general theme of the proposed research is that appropriately chosen penalty functions can achieve multiple statistical objectives simultaneously and in an integrated fashion. The importance of the variable selection problem across all disciplines, and the investigator?s collaborations with medical researchers and other scientists will allow the results to be readily disseminated into the applied research community where it can be used to improve the quality of life.

View original record on NSF Award Search →