BIGDATA: F: Scalable and Interpretable Machine Learning: Bridging Mechanistic and Data-Driven Modeling in the Biological Sciences

$900,000FY2017CSENSF

University Of California-Berkeley, Berkeley CA

Investigators

Abstract

With the rapid advances in information technology, an age of rich data has dawned in nearly every scientific field. Such data hold the potential to guide decision-making and accelerate understanding of complex processes such as human development and disease progression. For instance, massive databases on gene expression and other molecular processes can be used to build models to predict the drivers of a disease. Predictive models are an important step in understanding these complex systems, but equally important is the human interpretability of such models, e.g. to derive mechanistic insights into what factors drive disease onset in order to identify an appropriate course of treatment. Next Generation Sequencing (NGS) technologies have led to a profound shift in how biological data are collected, assaying individual genomic elements that act as part of organized, stereospecific groups to drive emergent biological phenomena. These modern data call for new statistics/data science principles and scalable algorithms to advance the frontier of science. This project focuses on developing novel scalable statistical machine learning algorithms that are predictable, stable and interpretable, and can be used to guide decision-making and discovery in biological systems. This project aims to build insights into how individual genomic elements act in concert by developing interpretable and stable supervised learning algorithms with state of the art predictive accuracy along with scalable, open source software. Many machine learning algorithms with state of the art predictive accuracies are capable of learning complicated rules that might govern complex systems but are difficult for humans to interpret. The research builds on iterative Random Forests (iRF), an algorithm recently developed by the PIs that recovers the high-order, human interpretable, Boolean type interactions that are important parts of the state-of-the-art predictive accuracy in Random Forests. The proposed work will develop and validate approaches for refining interactions recovered by iRF to produce testable hypotheses for follow-up studies, along with inference methods to assess the uncertainty associated with these hypotheses. These approaches and methods will be implemented in Apache Spark to ensure scalability to massive datasets in genomics and beyond. Implementation of the methods for the large-scale applications will leverage cloud computing resources provided through an agreement between commercial cloud service providers and NSF for the BIGDATA solicitation.

View original record on NSF Award Search →