Next-generation Tempering Methods for Multimodal Sampling: Theory and Applications
Texas A&M University, College Station TX
Investigators
Abstract
The data we collect today is more complex than ever, requiring sophisticated statistical models to uncover intricate patterns and extract useful information. These models come in various forms, such as neural network models for image and speech recognition, mixture models for text classification, and graphical models for studying the interaction between genes. But the increasing volume and complexity of the data make it infeasible to perform exact calculations with such models. A viable alternative used ubiquitously in data science is sampling, which can generate approximate, randomized, and computationally efficient solutions. For example, given a data set of thousands of genes, by generating random gene networks repeatedly according to some sampling scheme, one can determine which network best explains the observed data and quantify the probabilities of gene interactions. However, the performance of sampling methods can vary significantly across different problems and is often hard to analyze theoretically. In this project, a paradigm-shifting sampling methodology with theoretical guarantees will be developed. The methods developed are widely applicable and are particularly useful when a large number of good solutions to the problem exist, but only a few can be identified by traditional sampling algorithms. The research team will implement the methods for solving computationally intensive problems in genomics, such as heritability estimation of complex traits, and for studying the water use efficiency of different cotton varieties. The latter will help agronomists develop and select high-yielding cotton varieties with better drought tolerance. The research includes projects suitable for training graduate students and open-source software development. This project aims to develop new algorithms for sampling from multimodal target distributions, a universal computational challenge in statistics, machine learning, and applied sciences. A novel approach to devising Markov chain Monte Carlo methods will be developed, which bridges existing sampling techniques, including random walk Metropolis-Hastings algorithms, importance sampling, and simulated tempering. This new paradigm enables researchers to combine different techniques conveniently, leading to both theoretical and methodological advancements that are impossible in the classical framework. Particularly, the research team will develop new algorithms that involve multiple chains, with each chain’s behavior and computational cost optimized according to whether it is used for exploration or exploitation. Convergence analysis will be conducted under general multimodal settings to provide theoretical guarantees. One application of the proposed sampling methodology that will be researched in depth is the learning of partial differential equation models. A novel Bayesian methodology based on Gaussian processes will be developed for inferring unknown parameters in highly non-linear partial differential equation models, which can be seamlessly integrated with the proposed sampling methods. Other applications, including high-dimensional model selection problems, will also be studied, and algorithms tailored to each problem will be developed. All resulting software packages will be made publicly available. Ultimately, the statistical methodology and algorithms developed in this project will be utilized to tackle pressing problems in genomics and crop sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →