Modelling of Graphs, Networks and Trees for Genomic Applications: High-Dimensional Model Search

$1,910,000FY2004MPSNSF

Duke University, Durham NC

Investigators

Abstract

This research concerns the development of theory, methods and computational tools for statistical modelling motivated by applications in functional genomics. The over-arching theme is that of "High-Dimensional Model Search," i.e. the development of tools and methods for generating, evaluating and interpreting sets of relevant candidate models of a biological problem involving many variables. Research includes the investigation of models and computational methods for very large-scale graphical models, especially Gaussian graphical models and very sparse model structures, as representatives of the structure and patterns of association between biological variables from observed and experimental data. Contexts of gene and protein expression studies generate multiple motivating problems and data for case studies. In problems of realistic dimension (thousands) any one such graphical model is high-dimensional in both the variable domain (number of nodes) and numbers of parameters; but, the fundamental challenge is that the number of candidate models for a given problem is astronomical. In fitting and evaluating candidate models, it is typical that many candidates have plausible support based on fit to the available observational or experimental data and in the context of relevant biological information. Hence the need for methods to identify, explore and evaluate many plausible models, and to interpret and characterize their similarities and differences in both statistical and substantive biological terms. Additional classes of regression models, including nonlinear statistical classification and prediction tree models, represent flexible frameworks for the exploration, combination and utilization of multiple forms of biological data in predictive phenotyping for prognosis, diagnosis and discovery. In genomic contexts, high-throughput molecular information such as gene expression data leads to very, very many potential predictor variables to be assessed, so that, again, the challenges to statistics include dealing with and exploring spaces of very high-dimensional models. These central research challenges -- variable selection and model uncertainty that, with large numbers of variables, are simply unanswerable based on current mathematical and statistical methods -- form the core focus of this project. Motivated by and integrated with a number of genomic studies that provide data, collaborators and application contexts in functional genomics in several areas of application, this project involves model development and innovation in methods of stochastic search over complex and very high-dimensional statistical model spaces. This includes associated model theory, methods, and computational algorithm development that involve distributed cluster- based implementations, as well as feedback application in a number of biomedical studies. Modern biomedicine now has access to data of rapidly escalating scales and complexities based on innovations and advances in, in particular, genome technologies. This includes high-throughput genomic data from DNA microarray gene expression studies in both laboratory and human observational disease and exposure studies, related large-scale molecular information from proteomic and metabolic profiling technologies, genetic and sequence information that is quickly expanding towards genome-scale sequence data, and, of course, traditional forms of clinical, environmental and demographic data. Advances in both basic biology and the use of such information to inform and aid in human health studies requires very substantial advances in the capacity to analyse and interpret these data sets of ever- increasing scale and complexity. Bringing multiple forms of such data to bear in defining predictive phenotypes is at the core of the emergent arena of clinico-genomics, for example. The challenges to modern mathematical and computational modellers are those of very high-dimensional mathematical model specification and analysis, coupled with the need for computational tools to search across astronomical numbers of candidate models, an endeavor that is well beyond the current capacity to implement, evaluate and understand. This challenge defines the core agenda of this research project: the development of statistical and computational tools for very high- dimensional statistical models. The research is coupled with an integrated collection of genomic studies that provide data, biological collaborators and applications in functional genomics in several areas of cancer and cardiovascular biology, pathway studies in cell cycle regulation and oncogenesis, cancer proteomics, transcription regulation and other areas. At the core of the research lies innovation in methods of stochastic search -- i.e., simulation techniques -- over complex and very high-dimensional statistical model spaces, and the associated model theory and methods. The inherent intellectual merit of the research lies in the advances and innovative new methods in computation and statistical modelling, as well as specific biological applications. The broader impact of the research lies in the applicability of the resulting methods and tools to applications in these specific areas, in other related biomedical/genomic applications, and in other fields of science.

View original record on NSF Award Search →