CAREER: Scalable and Robust Uncertainty Quantification using Subsampling Markov Chain Monte Carlo Algorithms

$360,524FY2024CSENSF

Trustees Of Boston University, Boston

Investigators

Abstract

When trying to understand the workings of complex systems (whether it be individual cells or whole ecosystems), large datasets have the potential to provide deep scientific and operational insights. However, there are two major challenges that must be addressed: how to quickly yet rigorously process such large datasets, and how to avoid becoming overconfident in the conclusions reached, given the limitations of the data and knowledge of how such systems work. This research develops a comprehensive framework and set of algorithms for addressing both of these challenges in a general way, so that scientists and other data analysts can use them off-the-shelf, thereby accelerating the acquisition of new knowledge. The work will be developed in the context of two modern application areas of broad interest. The first is to enable biologists to learn about the inner workings of systems that are difficult or impossible to observe directly (such as the internal functioning of cells or the evolutionary history of animal species). The second is to enable ecologists to predict how ecosystems change over periods of time ranging from months to decades, thereby enabling better management of ecosystems and deployment of ecological monitoring efforts. The investigator is working directly with experts in these applications to have an immediate and substantive impact in both areas. In one educational component of the project, the investigator is a core member of the team developing new modern introductory applied statistics courses for undergraduate students at Boston University. The investigator is also writing an accessible textbook on the design and analysis of algorithms for data science, which will be of broad interest to students and researchers in machine learning, data science, statistics, and related fields. Despite many empirical successes, a lack of machine-learning methods with rigorous guarantees has resulted in systems that unpredictably perform poorly in real-world settings and therefore cannot be trusted for scientific discovery and safety-critical applications. Hence, there is an urgent need to create learning algorithms that are simultaneously scalable to the large datasets and high-dimensional models typical of machine-learning applications; able to accurately quantify uncertainty to ensure correct decision-making despite model misspecification, distribution shift, and data corruption; and are reliable and easy-to-use for the typical machine-learning practitioner. The primary technical objective of the project is to provide a comprehensive solution to these challenges by developing provably correct subsampling Markov chain Monte Carlo (MCMC) algorithms with automated tuning procedures. The key technical tool is a statistical-scaling-limits approach to establishing statistical and algorithmic foundations for how to tune basic subsampling MCMC algorithms designed for inference in latent variable and Gaussian process models, and for modified subsampling MCMC algorithms that can improve computational efficiency and numerical stability. These theoretical developments will be translated into practical, user-friendly algorithms with diagnostics that inform the user if the theory is applicable to their problem. The theory and algorithms will also be extended to distributionally robust losses such as maximum mean discrepancy. The research program is highly interdisciplinary, drawing on theory and methods from large-scale probabilistic machine learning, statistics, stochastic analysis, stochastic process theory, and numerical analysis. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →