RI: III: Medium: Scalable Machine Learning for Automating Scientific Discovery in Astrophysics
Carnegie Mellon University, Pittsburgh PA
Investigators
Abstract
The purpose of this work is to i) develop and validate new, efficient machine learning methods for making inferences and predictions in a massively parallel and distributed way on large-scale complex data sets coming from upcoming sky surveys, and ii) help answer important fundamental questions in cosmology and astrophysics using those new methods. Theoretical properties of these algorithms will also be investigated. The proposed cosmology and astrophysics applications will include a) building a probabilistic model for light intensity signals from stars, b) evolving the matter density of the Universe at a speed much higher than the traditional method of N-body simulations, and c) creating "mock catalogs" with all commonly observable galaxy properties. The methods that are developed will have far broader applicability than the examples listed here, both for other problems in astrophysics and problems in completely different domains (e.g. bioinformatics, climatology, social sciences), where complex scientific simulations require large-scale learning methods. The software developed in this work (including documentation, examples, and case studies) will be made publicly available. The PIs will also include the results in their course materials for graduate and undergraduate students. The aim of this proposal is to develop new machine learning methods that can work directly on large-scale, high-dimensional functions and continuous distributions as inputs or outputs in a regression problem, and can process large-scale scientific data in a massively parallel distributed way. Important theoretical properties, such as computational efficiency, sample complexity, generalization accuracy, consistency, lower and upper bounds on the convergence rates will also be investigated. Gaussian processes (GPs) are among the most popular nonparametric Bayesian function approximation methods. However, the standard GP methods are limited to at most a few thousands data points, and not applicable for large datasets. Kernel learning for GPs is an even more challenging problem. The question of how to scale up GP kernel learning methods for large datasets will be addressed as part of this project. Using the machine learning methods developed in this proposal, the following cosmology and astrophysics problems will be investigated: a) Scalable Gaussian processes with spectral mixture kernels will be used to build a probabilistic generative model for light intensity signals from stars to extract fundamental properties such as density profiles. b) New machine learning algorithms will be used to evolve the matter density of the Universe at a speed much higher than the traditional method of N-body simulations. This will enable a completely new way of generating a large number of cosmological simulations in order to compare the cosmological observations to our understanding of the Universe. c) Simulated galaxy catalogs are a powerful tool for testing cosmological analysis methods, since the cosmological parameters in the simulation are known and thus our ability to recover them can be tested perfectly. For a single cosmological simulation, the properties and alignments of galaxies are not fully determined, but rather must be added probabilistically to the dark matter distribution as an extra layer of modeling typically with many parameters. The new machine learning tools developed in this proposal will be used to make "mock catalogs" with all commonly observable galaxy properties.
View original record on NSF Award Search →