ATD: Diffusion and Transport on Graphs: Active Learning, Low-Dimensional Representations, and Anomaly Detection
Tufts University, Medford MA
Investigators
Abstract
Data across scientific disciplines ranging from biology and chemistry to computer vision and image processing to geography and economics are naturally represented as graphs. For example, each node in a graph may represent an observed data point, with edges connecting similar data points. Alternatively, a single observed data point may be understood as a graph in and of itself (e.g., a geographic map or a molecule). In order to efficiently analyze such datasets, computational methods that account for latent structure and scale efficiently to graphs with large numbers of nodes are essential. Both rigorous mathematics and tractable algorithms are required to make full use of rich and varied graph data generated at terabyte scale by contemporary sensors and simulations. In this context, the investigator will develop methods for representing, analyzing, and learning functions on heterogeneous graphs. The project will develop open-source software available to the public and provide training opportunities for graduate students, particularly those from underrepresented groups. The project will focus on two interconnected settings: (i) semisupervised learning of functions on a graph via time-inhomogeneous diffusion processes that infer statistical and geometric properties from a small number of function queries; (ii) analysis of probability distributions across heterogeneous graphs via notions of optimal transport. In the context of semisupervised learning, the geometry of the graph will be combined with revealed label information via active learning, yielding time-inhomogeneous Markov diffusion matrices that succinctly characterize the data. When considering datasets consisting of heterogeneous graphs with differing numbers of nodes and connectivity properties, the framework of Gromov-Wasserstein distances will be leveraged for change and anomaly detection. The emphasis of this work will be on core problems in graph theory, high-dimensional probability and statistics, and machine learning, but many of the conjectures of interest and proposed algorithms will be directly motivated by real data science problems. Remotely sensed image processing will feature as a particular broader impact of the core theoretical and algorithmic investigations. A suite of semisupervised learning and graph analysis tools that enjoy mathematical performance guarantees will be released as open-source software. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →