Density-Preserving Maps

$180,000FY2009MPSNSF

Georgia Tech Research Corporation, Atlanta GA

Investigators

Abstract

This project will investigate two aspects of high-dimensional statistics. First, it develops a new alternative paradigm for nonlinear dimension reduction (often called manifold learning) in which, instead of preserving local distances in the original space as done by existing approaches, the approach preserves the densities in the original space. The motivation is twofold: Using results from Riemannian geometry, the investigators have shown that is not possible in general to preserve distances, and that it is always possible to preserve densities; in addition, because perhaps the common scientific use of nonlinear dimension reduction methods is to visualize clusters and outliers, which are arguably best formally described in terms of densities, it can be argued that this approach directly preserves the actual information of interest. This is achieved by means of novel formulations resulting in least-squares problems, as shown in preliminary work, or convex optimization problems to be developed. Second, the project develops theory and methodology for nonparametrically estimating the densities of points lying on a submanifold, which is needed as the first step in the overall approach. This includes asymptotic results which are dependent on the dimension of the submanifold rather than that of the ambient space, as current exist. This provides contrast to the popular conclusion that nonparametric estimation in high dimensional spaces is simply intractable. Theoretical, methodological, and experimental development will be performed. Very high-dimensional data, such as text documents, images, or astronomical spectra as typically encoded, have become increasingly important and prevalent, while statistical theory and methods have only recently attacked such problems with full vigor. Such data are critical for homeland security, medicine, remote sensing of the environment, e-commerce, and a host of other domains. The intellectual merit of the work is the introduction of a new way of formulating and analyzing two fundamental statistical operations on such data, called dimension reduction and density estimation. Each of these could open the door to new avenues in the much-needed area of very high-dimensional statistics. The broader impact of the work is the transformative ability of analysts to reliably identify outliers and clusters in high-dimensional data -- for example such a tool could help astronomers identify new types of astrophysical objects. The work will be distributed as part of a well-distributed state-of-the-art toolbox of statistical methods to maximize impact across many areas of data analysis.

View original record on NSF Award Search →