High-Dimensional Regression Modeling via Distributed Computing

$111,181FY2007MPSNSF

Ohio State University Research Foundation -Do Not Use, Columbus OH

Investigators

Abstract

The research concerns the development of methods for analyzing complex, high-dimensional Bayesian regression models, with emphasis placed on the use of distributed computing to advance existing statistical methodology and inform new ways of thinking about high-dimensional model uncertainty problems. In the context of regression modeling, the availability of many predictor variables (the "large p" scenario) necessarily leads to questions of variable selection and model uncertainty: which predictors are most important with respect to an outcome of interest, and how does one address the reality that there may be many combinations of different predictor variables that all provide similar fit to the data? Both questions are difficult from a computational perspective, as large p problems generate model spaces that contain enormous numbers of models. The recent development of computationally-intensive, parallel-computing-based stochastic search methods designed to quickly explore promising regions of model space has given rise to new questions regarding both the search methods themselves and also the ways in which results from such searches can inform us about the scientific problems at hand. Key components of the methodological focus of the research include (i) characterization and analysis of recently developed stochastic search methods designed to explore high-dimensional model spaces, (ii) combination of output from such search methods with theoretical results to aid in sparse regression modeling, and (iii) the development of methods that use relationships between the many possible predictor variables to aid in model search and prediction. The impact of the research is wide-reaching: large, complex datasets have become common in many areas of science where focus is often placed on determining how potentially thousands of predictor variables combine to predict an outcome of interest. A current example is in the field of clinico-genomics in cancer studies. Tumor samples from cancer patients can be used to generate data about the "activity" of upwards of tens of thousands of genes in a tumor. The research described above will allow for information to be extracted from this data to aid in answering important questions such as "which genes work together to create particularly aggressive types of cancer?" The ability to extract such information from large, noisy datasets may lead to simple prognostic tests that inform doctors about the potential severity of cancer in a patient, which in turn can lead to better decisions about treatment options. As a second example, the research described above has the potential to increase the efficiency of marketing by businesses: companies that collect enormous amounts of data on the purchasing habits of their customers will be able to extract information that better informs them about to whom they should market particular products. This will lead to greater corporate efficiency and in turn contribute to economic growth.

View original record on NSF Award Search →