Methods for Nonlinear, Non-Gaussian, and Data-Driven Ensemble Data Assimilation in Large-Scale Applications
University Of Colorado At Boulder, Boulder CO
Investigators
Abstract
A wide range of disciplines, from weather to epidemiology to reservoir management, depend on data assimilation, that is, a set of methods that combine incomplete and imperfect observations with a forecasting model to estimate and predict the state of a complex, evolving system. For large-scale systems such as those in weather forecasting, computational efficiency is essential, so the methods used often rely on a Gaussian approximation - assuming that properties are distributed according to a bell curve - because this approximation unlocks highly efficient algorithms. However, in practice many quantities of interest, from sea ice thickness to rain rates, are not described by a bell curve, and predictions can be inaccurate. This project aims to develop new algorithms that are not based on a bell curve approximation but that can still be used in large-scale applications where computational efficiency is crucial. This inherently interdisciplinary project will provide a multitude of opportunities for training and professional development of the next generation of statisticians and data scientists, with a particular focus on enhancing diversity and inclusion. The new insight on which the research project is built is a novel representation of the Bayesian posterior distribution through the introduction of a new synthetic random variable. The Bayesian posterior can be represented as the expected value of the probability density of the state variable conditioned on the new variable, where the expectation is taken with respect to the posterior on the new variable. This insight enables the use of a two-step approach to sampling from the posterior. The first step is standard Bayesian sampling but with a lower dimensionality, while the second step is based on regression. The project will combine methods from low-dimensional Bayesian computation for the first step with generalized linear regression and/or machine-learning regression for the second step. A skeleton of the two-step approach is already implemented in the Data Assimilation Research Testbed (DART) software suite, and DART enables data assimilation with over 25 geoscientifically-relevant models including the National Water Model and the Community Earth System Model. The research findings will be implemented in a form of new advanced two-step non-Gaussian algorithms in DART, making them available to a wide range of users. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →