CAREER: STATISTICAL INFERENCE FOR TOPOLOGICAL AND GEOMETRIC DATA ANALYSIS

$400,000FY2012MPSNSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

The research objective of this proposal is to develop new theories and methods for estimating topological and geometric features of lower-dimensional sets based on noisy high-dimensional data. To this end, the investigator has formulated two separate but highly interdependent sets of research goals. The first set of research goals is the integration of statistical theory with methods of topological data analysis. Recent breakthroughs in computational topology have made it possible to compute topological invariants of sets from a collection of points in Euclidean spaces. Though the potential for high-dimensional statistical inference of these new types of data summaries is significant, their statistical properties are still largely unexplored. The investigator proposes to 1) to develop a comprehensive theory of minimax (and adaptive) estimation of topological properties of sets and 2) to create statistical procedures for non-parametric testing and de-noising based on topological invariants. The second set of research goals pertains to the traditional geometric data-analytic task of clustering in high-dimensions, and it is aimed at advancing the theory and practice of high-density clustering. Recent progress in the theory of clustering has demonstrated that clustering using density estimation can perform well in high-dimensional settings, and that the notion of high-density clustering provides a natural probabilistic framework for describing and analyzing clustering problems in great generality. Thus, the investigator intends 1) to generalize and refine the high-density clustering problem under weak conditions on the data-generating mechanism and 2) to investigate the theory and use of data resampling techniques for parameter tuning in high-density clustering and density estimation. A common thread in the proposed research is the reliance on density estimation, as a tool for both accurate high-dimensional clustering and smoothing/de-noising of topological features. In the last few decades, advances in data acquisition technologies have led to an explosion in the collection and diffusion of large-scale datasets, across a variety of scientific fields. The unprecedented magnitude and complexity of modern databases pose formidable challenges to statisticians, both of theoretical and methodological nature, and has required the development of new statistical tools for data analysis. Modern high-dimensional statistics is predicated on the key assumption that, while the data are observed in a high-dimensional space, the intrinsic complexity of the data-generating mechanism is in fact significantly smaller and, therefore, learnable in computationally efficient ways. This research proposal capitalizes on this premise, and describes an array of methods for summarizing, discriminating, visualizing and clustering high-dimensional noisy data and for extracting salient low-dimensional features. The proposed research encompasses several novel and open research problems at the interface of mathematics, computer science, statistics and machine learning. The procedures studied in the proposal are of broad applicability and promise to be used in a multitude of scientific areas, such as medical imaging, neuroscience, astrophysics, biology, genetics, geophysics and sensor networks, just to name a few. The broader impact of this project also includes interdisciplinary training of students in statistics, mathematics and computer science.

View original record on NSF Award Search →