Efficient Distribution Classification Tasks via Optimal Transport Embeddings

$87,406FY2022MPSNSF

University Of North Carolina At Chapel Hill, Chapel Hill NC

Investigators

Abstract

Classification allows one to organize data based on similarities and can provide insight into underlying relationships in a large variety of fields, including cancer research, survey analysis, and image and text processing. As a result, the development of efficient algorithms for classification tasks is an important research area. One approach, machine learning, has proved successful in classification tasks, but it is usually focused on data points in vector spaces. In many applications, however, instances of data are naturally interpreted as entire point clouds, or as distributions, and do not lie in a vector space. Furthermore, the high dimension of such datasets leads to theoretical and computational challenges. This project is devoted to the development of classification algorithms for high-dimensional datasets consisting of distributions, and will focus both on their theoretical analysis and computational efficiency. To this end, the principal investigator will use the framework of optimal transport, which provides a natural way of comparing distributions. Students will be involved and trained in interdisciplinary aspects of this project. This project applies knowledge from computational optimal transport, such as linear embeddings and regularized optimization, and machine learning algorithms, to study classification tasks for datasets consisting of distributions. The main goal is to develop approximation methods with guaranteed error bounds that also allow for algorithmic insights and efficient implementation. Open problems on approximation power, computational feasibility, and numerical analysis will be addressed. Specifically, the project addresses four fundamental questions that arise in the field: (1) What are the types of distributions that can be classified with traditional machine learning techniques through linear embeddings, and how does the choice of a regularizer affect accuracy? (2) How well can the Wasserstein distance and Wasserstein barycenters be approximated through linear embeddings using Euclidean distances? (3) Under which conditions can we guarantee separability with simple classifiers in the embedding space for disjoint classes of distributions? (4) How can we tailor our framework to address various applications, such as classifying structures in audio or video segments? This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →