Towards Harmonic Analysis in Wasserstein Space: Low-Dimensional Structures, Learning, and Algorithms

$370,008FY2023MPSNSF

Tufts University, Medford MA

Investigators

Abstract

Developing efficient methods for statistical and computational data analysis is an essential task in the 21st century. This project will leverage structures in large, high-dimensional data which has the potential to transform a range of scientific areas including image processing, natural language processing, and computational social sciences. Data will be modeled as probability measures in order to develop tractable and data-efficient machine learning methods that crucially leverage intrinsic geometric properties of the data (e.g., shapes in images and semantics in text documents). The focus is on theoretical foundations and scalable algorithms that compete with state-of-the-art black box methods while retaining a high degree of interpretability. Beyond the core focus on mathematics, data science, and machine learning, these frameworks and algorithms have immediate applications to geoscience and geography. The new data analysis tools developed will allow the investigators to address long-standing open problems in these fields, which will provide important sources of validation data. A major component of this project is training both PhD students and undergraduates in research at the intersection of mathematics, data science, and computing. Specifically, the focus of the project is on fundamental problems of (1) modeling with and computing in intrinsically low-dimensional sets of probability measures and (2) the learnability and expressivity of these low-dimensional sets. The investigators treat data as measures in Wasserstein space and propose to develop analogues of low-dimensional models in the classical vector space setting by allowing for efficient synthesis and analysis of observed measures with respect to a set of reference distributions. The focus is on two major problems pertaining to low-dimensional structures in Wasserstein space. First, what is the "correct" model for intrinsically low-dimensional subsets of Wasserstein space, and how can they be computed? The investigators propose to mimic notions of low-dimensional subspace (exact and approximate), by proposing three models that leverage the geometric properties of Wasserstein space, as well as recent computational advances in entropic regularization. The goal for these three models is to efficiently encode (analyze) measures by their coefficients in these low-dimensional subsets of Wasserstein space, and also decode (synthesize). The focus is on efficiency from both a statistical (e.g., estimating given samples from the measures) and computational (e.g., designing algorithms with sub-cubic complexity) perspective. Second, what are the efficient, computable, and expressive representational systems in Wasserstein space? The investigators propose to tackle the data-driven problem of, given observed probability measures, how to identify and learn a small number of reference measures that can represent them efficiently. In parallel, the investigators will study the fundamental problem of what systems are expressive enough to represent typical elements of Wasserstein space. This provides the crucial connection between the computational algorithms and an embryonic theory of harmonic analysis in Wasserstein space. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →