Bayesian Partition Models for Detecting Influential and Interactive Variables

$349,850FY2010MPSNSF

Harvard University, Cambridge MA

Investigators

Abstract

Variable selection in regression modeling is a long-standing problem in statistics. Recently, there has been a significant surge of interest in analytically elegant, numerically robust, and algorithmically efficient variable selection methods, largely due to the tremendous advance in data collection techniques such as those in biology, internet, and marketing. This proposal considers the model selection problem in both univariate and multivariate high-dimensional regression problems with discrete covariates. The main goal is to develop Bayesian methodologies enabling the discovery of interactions among certain independent variables (e.g., genetic markers) affecting the response(s). By taking a Naive Bayes modeling perspective and introducing flexible latent structures to model dependence among variables, the investigator will design methods that can detect interactions of a handful of covariates among tens of thousands of candidate predictors. By introducing ?individual type? variables, the investigator outlines a strategy to decouple the modeling of the covariates from that of the responses. This strategy allows one to link a subset of covariates to a subset of the responses, which is an important goal in many biomedical studies. Strategies for extending the ideas to cases with continuous covariates will be explored and tested. This research will also bring the power of these new methods and theory to bear on several important application areas such as genetics, bioinformatics, and economic data analysis. Selecting a subset of predictors among a large number of candidates for accurate prediction of certain outcomes (e.g., weather, stock price movement, heart disease risk etc) is a long-standing and challenging problem in statistics. Recently, there has been a significant surge of interest in analytically elegant, numerically robust, and algorithmically efficient variable selection methods, largely due to the tremendous advance in data collection techniques such as those in biology, internet, and marketing. It has now been widely recognized by both general scientists and quantitative modelers the importance of discovering among many candidate factors that are truly influential on the outcomes/responses. The proposed research is motivated by important genetics and genomics problems. Its goal is to develop statistical and computational strategies for not only selecting informative predictors but also discovering interactions among the predictors that may significantly influence the outcome. In genetics, such interactions among genetic mutations are called ?epistasis,? and their detection is one of the main challenges in the post-genome era. In HIV drug-resistance mutation studies, such interactions often reveal new insights on the molecular basis of virus's drug resistance and can lead to innovative and effective treatments. It is expected both to enrich statistical modeling theory and to provide novel computational and statistical strategies applicable to a wide range of problems in diverse fields. The preliminary tools the investigator has developed have already been successfully applied to a number of genetics and biomedical studies. These methods can also be readily applicable to many ?data mining'? tasks, such as text mining, network studies, and e-commerce. It will provide both educational and interdisciplinary research opportunities for graduate students, and will result in software that may be of interest to biomedical researchers, economists, and other practitioners.

View original record on NSF Award Search →