BIGDATA: F: Towards Automating Data Analysis: Interpretable, Interactive, and Scalable Learning via Discrete Probability

$1,024,363FY2017CSENSF

Massachusetts Institute Of Technology, Cambridge MA

Investigators

Abstract

As machine learning (ML) permeates all areas of science and technology, demands in diverse data domains, inference questions, resource limitations and reliability fuel several new conceptual and algorithmic challenges. Examples of current shortcomings that limit the full use of machine learning include suboptimal use of data and algorithms; painstaking hand-tuning and model search; validation of results and difficulties in generalization; limited interactivity with humans; encoding of domain knowledge; and lack of interpretability, among others. Progress on these questions has the potential to impact the successful adoption and use of machine learning in a broad range of fields. With the above motivation, the goal of this project is to create a novel suite of models and algorithms for analyzing complex datasets, with a particular focus on the following three factors crucial for next-generation machine learning: (1) interpretability; (2) interactivity; and (3) automated learning. The overarching technical concept underlying this proposal is the concept of negative dependence in discrete probability. This project lays theoretical foundations for a new set of tools grounded in this concept. Besides practical impacts, the methods to be studied in the project motivate new theoretical questions, and will help increase interest in the underlying mathematics. The practical impact of the proposed work has the potential to benefit society on multiple fronts. Via collaborations, the PIs will evaluate the developed methods in healthcare (seeking to ultimately impact patient care and well-being), systems biology (to help with research on cancer and diabetes, among others), and materials science (to help discover safer, functional materials more efficiently). The project will also directly have educational impact: training of graduate students, providing material for data science courses at all levels, and outreach to the community via general talks as well as focused lectures at conferences and workshops, including workshops and events targeted at women in Data Science. Technically, the PIs will develop: (1) New tools, models, and algorithms for interactive data analysis, especially for experimental design, information collection, interpretable machine learning, hypothesis testing, performance validation, and architecture learning; (2) Theoretical analysis, such as convergence and complexity (statistical and computational); and (3) Open-source implementations of all key algorithms and frameworks.

View original record on NSF Award Search →