CRII: RI: Explaining Decisions of Black-box Models via Input Perturbations

$174,942FY2018CSENSF

University Of California-Irvine, Irvine CA

Investigators

Abstract

Machine learning is at the forefront of many recent advances in science and technology, enabled in part by complex models and algorithms. However, as a consequence of this complexity, machine learning systems essentially act as "black-boxes" as far as users are concerned. Thus, it is incredibly difficult to predict what they will do when deployed, understand why they are making the decisions, guarantee their robustness, or broadly speaking, trust their behavior. As these algorithms become an increasing part of our society, our financial systems, our healthcare providers, our scientific advances, and our defense systems, it is crucial to address this challenge. In this work, the PI and his team will develop algorithms that explain why any classifier is making its decisions, without any access to its underlying implementation, in order to make the inner workings understandable to the users. Such explanations make machine learning more transparent, leading to a more robust evaluation pipeline, reduced debugging efforts, and increased ease of use (and of trust) of these complex, black-box systems. For a decision made by a machine learning classifier, the team will develop methods that accurately characterize the relationship between the input instance and the algorithm's prediction, and present it in an intuitive manner. The primary intuition is to estimate the instance-specific behavior of the predictor by observing the output of the classifier as the input instance is perturbed. The first proposed thrust of this work extends this basic framework by considering rules that define counter-examples, and summarize the behavior over multiple instances, providing detailed and accurate insights into the behavior with minimal effort on the users' part. The second thrust identifies automated ways to learn domain-specific perturbation functions that generate realistic instances to compute the explanations. The team proposes a comprehensive evaluation of these explainers consisting of user experiments in comparing, trusting, and modifying machine learning algorithms, with applications to diverse tasks such as sentiment analysis, machine translation, time series, visual question answering, and object detection. Due to the many potential applications of this work, both for machine learning practitioners and end-users, dissemination of the results is a key focus, and the team will augment standard channels (such as publications) with novel ones that include open-source software, jargon-free documentation, and interactive tutorials/demonstrations to encourage application of machine learning to novel domains. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →