Discovering Sparse Covariance Structures in High Dimensions

$249,993FY2008MPSNSF

Regents Of The University Of Michigan - Ann Arbor, Ann Arbor MI

Investigators

Abstract

This project focuses on discovering and exploiting sparse structures in the data to improve estimation of covariance matrices in high dimensions. The covariance matrix plays a key role in many data analysis methods, including principal component analysis, discriminant analysis, inference about the means in multivariate analysis, and inference about independence and conditional independence relationships in graphical models. Advances in random matrix theory have shown that the traditional estimator, the sample covariance, performs poorly in high dimensions. The existing research on alternative estimators, including previous work of the PI, focuses mostly on the situation when there is a notion of distance or ordering for the variable indexes (time series, longitudinal data, spatial data, spectroscopy, etc). However, there are many applications where such ordering is not available: for example, genetics, financial, social and economic data. This project develops several methods for constructing regularized sparse estimators that are invariant to variable permutations, both for the covariance matrix and its inverse. The main building blocks of the methods are thresholding, smooth penalties that encourage sparsity, permutation-invariant loss functions, adaptive weights, and manifold projections to discover potential structured re-orderings of the variables. Analytical results establishing consistency and convergence rates of the proposed estimators in high dimensions are fully developed. These theoretical results in high dimensions require tools that are different from standard asymptotic analysis, and there are few available in the existing literature. Efficient optimization algorithms needed to compute these estimators are developed, with the emphasis on the computational cost growing as slowly as possible with dimension. Some of the estimators proposed carry a very low computation cost by design, while others require computational ingenuity to be feasible in really high dimensions. The proposed methodology is tested extensively, both in simulations and on a number of applications through the PI's interdisciplinary collaborations. Massive amounts of data collected in the modern world are creating new challenges for statisticians. There is an urgent need for new theoretical and practical methods that deal with high-dimensional data, and a vast number of applications where high-dimensional covariance matrices need to be estimated as part of data analysis: finance, genetics, spectroscopy, remote sensing, climate studies, brain imaging, speech recognition, and many others. The PI has ongoing collaborations with chemists on Raman spectroscopy of bone, with oceanologists on using spectral data for remote ocean sensing, with climate scientists on temperature modeling and with a biostatistician on a new type of gene expression technology that works at protein level. The PI also works actively in the area of statistical signal processing by wireless sensor networks, where spatial covariance estimation is important, and which has many security applications. The new methodology for estimating high-dimensional covariances developed in this project is analyzed theoretically and tested and validated in these applications, and in turn, the directions in which the project develops at later stages are influenced by the issues and needs of the applications. The project also contributes to educating graduate students in an important area of modern statistics.

View original record on NSF Award Search →