Moment Invariant Data Aggregation for Signal Processing and Distribution Learning

$360,000FY2023MPSNSF

University Of Utah, Salt Lake City UT

Investigators

Abstract

The explosion of data in the modern era from diverse sources has greatly increased the need for reliable data integration tools. In many applications, one has access to vast amounts of data, but the data is highly noisy and only meaningful when aggregated. For example, in cryo-electron microscopy, one has access to a large volume of molecular images. Still, each image is highly noisy and reflects a random shift, orientation, and projection, making integration highly challenging. Another example is microarray expression data, which is generally collected in small batches with significant batch effects. The traditional statistical approach is to “standardize the data” before integration, i.e., one subtracts the mean and scales by the standard deviation in each batch and then combines the standardized batches. However, when the sample size of each batch/subpopulation is small, this approach will fail since within-batch estimates of mean and variance can be unreliable. This project will develop computationally efficient and scalable methods for aggregating noisy data from multiple sources. The methodology will be applied to data from the criminal justice system to gain insights into race and gender-based discrepancies. The project will support several tiers of mentorship as well as graduate and undergraduate research and will contribute publicly available code. In addition, the research will also serve as a bridge between the fields of signal processing and statistics. The project aims to develop mathematical tools for data aggregation invariant to the first and second moments of the data in two distinct contexts. The first context is data aggregation for multi-reference alignment (MRA), a topic motivated by biological applications such as cryo-electron microscopy. In classic MRA one attempts to recover a hidden signal from many noisy observations, where each noisy observation has been randomly translated and corrupted by additive noise. The investigators will explore a generalization of this model where each noisy observation is also corrupted by a random scale change. The research will develop a computationally efficient method for full signal recovery that utilizes Fourier and wavelet-based features to unbias the noisy data for the random scale change. The second context is data aggregation for distribution learning. The investigators consider the scenario where data is collected from various sub-populations to produce data batches, but the sample sizes of the batches are small. Under a simple but compelling model, localization factors affect only the first and second moments of the sub-populations, so that each batch consists of independent, identically distributed observations from some shifted and rescaled universal distribution function. The goal is to recover the underlying distribution by aggregating the sparse data. This is a highly relevant problem, as it allows for reliable nonparametric estimates of the density in settings where traditional approaches fail and, more broadly, leads to more precise comparisons across sub-populations in settings where little data is available. By viewing this problem as a generalized MRA problem where uncertainty due to random sampling replaces Gaussian noise, this project contributes an innovative set of tools for its solution inspired by methods in signal processing. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →