CI-SUSTAIN: Stan for the Long Run
Columbia University, New York NY
Investigators
Abstract
Stan is a software package that transforms scientific discovery by allowing scientists to quickly and easily explore, evaluate, and refine rich scientific hypotheses tailored to their particular research question and data collection mechanism. For computational reasons, analyses of data (big or otherwise) have tended to be simple and focused more on the difficulties of manipulating the data than on realistic scientific models. The next generation of Bayesian inference can take scientists beyond this impasse, via sophisticated models that can adjust for differences between sample and population, and between treatment and control groups, to join the benefits of large datasets with the rigor and power of statistical adjustment. Stan further helps educate the next generation of data scientists, with a natural, easy-to-learn and portable modeling language, coupled with robust, practical inference tools. The specific goal of this project is to solidify the Stan code base to enable application, maintenance, and development of the Stan software. Stan is being applied in many corners of the physical, biological, and social sciences, hundreds of at scales ranging from the neutrinos to supernovas, from cellular biology to population ecology, and from human reaction times to social network evolution. In this project the PI aims to document and ruggedize the core infrastructure of Stan to enable it to be used by a wider audience of scientists, to be maintained by a wider group of software developers, and to be extensible to allow for the future development of new scientific applications and statistical algorithms. Technically, this project is devoted to thoroughly documenting for users and developers, testing at the unit, integration, and functional levels, and inevitably refactoring the application programming interfaces (API) of Stan's components. These components include (1) an automatically differentiable mathematics, statistics, and matrix algebra library, (2) an imperative probabilistic programming language for expressing statistical data/parameter structures and scientific/measurement models, (3) core inference algorithms for providing exact and approximate full Bayesian parameter estimation and predictive inference, (4) a service layer of high-level commands, I/O callbacks, and interrupts, and (5) interfaces integrating Stan's probability modeling, analysis and visualization capabilities into existing data science workflows in languages including R, Python, Julia, Stata, MATLAB, Mathematica and the command-line for cloud and cluster computing. The main goal of the refactoring is to achieve enough modularity in system components and documentation that developers will be able to concentrate on a single component, such as a new function, algorithm, or visualization, as well as to allow Stan to be used as a research tool for algorithm development. These goals complement the goals of documenting the language and interfaces in order to promote rigorous statistical methodology and reproducible computational workflows for applied scientists.
View original record on NSF Award Search →