CAREER: Active Machine Learning for Automating Scientific Discovery

$497,693FY2019CSENSF

Washington University, Saint Louis MO

Investigators

Abstract

It is often much easier to collect and catalog features of data than to analyze data to determine properties of interest. Such settings are pervasive in the natural sciences and engineering, where in-depth investigation can require human intervention, expensive computer simulation, or costly laboratory experiments. Humanity is at the tipping point of a data revolution, and our ability to collect and store information will likely outpace our capacity to extract useful knowledge from data. Active machine learning provides a solution to this dilemma: we adaptively design expensive experiments guided by statistical models of the underlying process to make the most-effective use of limited resources. Numerous studies have established active machine learning as a promising tool for automating scientific discovery; however, modern procedures are currently difficult for practitioners to adopt. Considerable expertise is required to effectively use the available tools, especially as the field of machine learning continues to develop rapidly. This project will transform the application of active machine learning to problems from science and engineering, developing novel experimental procedures and pioneering new paradigms of scientific discovery. This project will also dramatically increase the availability of these methods to non-experts through automation, facilitating the further integration of machine learning into practice across disciplines. All research will be motivated by problems and validated on data from applications across science and engineering, including materials science, drug discovery, astronomy, and robotics. The project's research objectives will be accompanied by a comprehensive education plan designed to introduce active machine learning to a broad range of future scientists and engineers. The project will entail two broad themes of inquiry, corresponding to the two critical components of an active learning pipeline: (1) experimental policies and (2) modeling. (1): The core of an active learning procedure is its policy, which decides which data to analyze. A primary challenge when building an active learning system is developing a computationally efficient and empirically effective policy for the given learning objective. This is not a straightforward task: the optimal procedure is computationally infeasible and natural approximations can suffer from myopic, greedy behavior. This project will improve the performance and theoretical understanding of policies for automated scientific discovery, developing and studying both established and novel paradigms for active scientific discovery. A theme throughout this investigation will be nonmyopic decision making, where one reasons about the impact of each decision on the entire learning task. Algorithmic development will be accompanied by extensive theoretical study, establishing fundamental learning bounds and seeking efficient approximation schemes when possible. (2): The second thrust of the investigation will be on modeling complex processes from data, as a policy's success hinges on being guided by an informative model. Model selection for active learning is rendered difficult by inherently limited training data, and accounting for model uncertainty is often critical. The project will investigate automated model selection inline with active learning, advancing the nascent field of automated machine learning to create robust, fully automated active learning systems that do not require expert design or tuning. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →