Collaborative Research: AF: Medium: Algorithmic High-Dimensional Robust Statistics

$600,000FY2021CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Abstract

The broad task of making accurate inferences from high-dimensional and contaminated datasets is of fundamental importance and has become a key challenge in a number of pressing data-analysis applications. These include (1) data-poisoning attacks in machine learning (ML), where even a small fraction of adversarial data inserted by malicious users can substantially degrade the quality of the ML system, and (2) exploratory analysis of scientific datasets (e.g., in biology), where systematic errors can create structured corruptions that require painstaking effort to detect. To address these challenges, there is a real need to develop efficient robust learning algorithms -- methods whose performance is stable to deviations from the idealized assumptions about the input data. The precise form of these deviations is problem-specific and gives rise to various notions of robustness. The overarching goal of this project is to develop a general algorithmic theory of high-dimensional robust statistics and learning. A crucial component of the project involves building bridges between different communities, by organizing interdisciplinary workshops, and writing a new graduate textbook on the topic. Moreover, the investigators are mentoring undergraduate students and design new data-centric courses integrating research and teaching. The technical core of this project consists of two interrelated thrusts: (1) List-Decodable Learning and Mixture Models: The majority of recent literature in algorithmic high-dimensional robust learning focuses on the setting where the clean data is the majority of the dataset. List-decodable learning is a relaxed notion of learning capturing the regime where the clean data is a minority of the input dataset, and can be used to model important data-science applications, such as crowdsourcing with a majority of unreliable respondents and learning-mixture models. The project is developing a unified theory with the goal of uncovering which distributional parameters can be efficiently list-decoded, and leveraging this theory to understand the complexity of learning mixture models. (2) Robust Supervised Learning of Geometric Concepts: The goal of supervised learning is to infer a function from a collection of labeled observations. Supervised learning has traditionally been concerned with the problem of generalizing from a set of correctly labeled examples. In many realistic scenarios, a fraction of the points and/or labels may be corrupted by noise, e.g., due to sensor errors or adversarial data poisoning. Hence, it is important to develop efficient algorithms that produce accurate predictors under these conditions. The project is developing efficient robust learning algorithms for rich families of geometric concepts with respect to natural and well-studied semi-random noise models. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →