CAREER: Scalable methods for discovering multivariate dependencies in high dimensional data.

$400,004FY2014MPSNSF

Stanford University, Stanford CA

Investigators

Abstract

This proposal aims to develop principled methods for discovering multivariate dependencies which cater to ultra high dimensional settings. A common theme that unites the proposed methods is scalability and identification of their limitations. A popular approach to identifying sparse inverse covariance matrices is through penalized likelihood methods. We propose a novel approach for solving the penalized Gaussian log-likelihood that is faster than its competitors by many orders of magnitude. The second research component in the proposal investigates the statistical properties of thresholded matrices in finite samples, with a view to obtaining a positive definite covariance estimation method which is highly scalable. The third research aspect of the project investigates quantifying the variability and uncertainty of estimated graphical network models. A methodology that takes advantage of a convex pseudo-likelihood formulation of the graphical model selection problem is introduced. This allows for the development of a highly scalable uncertainty quantification method with theoretical safeguards. The fourth research aspect of the project examines the use of the methodology proposed in the previous three sub-components to an application in the area of climate change, where high dimensional covariance estimation is required. The proposal also has a significant teaching and outreach component which aims to introduce statistics to aspiring young scientists at various stages of their undergraduate and graduate studies. The availability of high-throughput data from various applications, including genomics, environmental sciences and others, has created an urgent need for methodology and tools for analyzing high dimensional data. Extracting and making sense of the many complex relationships and multivariate dependencies in the data and developing principled inferential procedures is one of the major challenges facing statisticians and data scientists. The theoretical and methodological work proposed in this project is motivated by applications and interdisciplinary collaborations in fields as diverse as the earth and environmental sciences, genomics and cancer research, and the social sciences. In genomics for instance, one is often interested to know how various genes are associated, and how these associations differ between an experimental (diseased) and control group. Gene regulatory networks also serve as important tools to study the evolutions of diseases. In the context of the climate change debate, modeling temperature at different points on the globe requires parsimonious modeling of the way in which these variables are related. Modeling correlations also arises naturally in material sciences and engineering where one is interested in seeing how different atomic particles interact when new materials are produced. Hence the proposed project for estimating correlations in very high dimensional settings will have widespread applications, since understanding associations/relationships between many variables is an endeavor that is common to many scientific disciplines. The proposed work, though firmly rooted in the statistical sciences, is very much interdisciplinary, and involves collaborations and partnerships between statisticians/data scientists and biomedical scientists, engineers and earth scientists.

View original record on NSF Award Search →