Collaborative Research: RI: Small: Post hoc Explanations in the Wild: Exposing Vulnerabilities and Ensuring Robustness

$225,000FY2020CSENSF

Harvard University, Cambridge MA

Investigators

Abstract

The successful adoption of machine learning (ML) models in critical domains such as healthcare and criminal justice relies heavily on how well decision makers are able to understand and trust the functionality of these models. However, the proprietary nature and increasing complexity of ML models makes it challenging for domain experts to understand these complex "black boxes". Consequently, there has been a recent surge in techniques that explain black box models in a human interpretable manner by approximating them using simpler models. However, it is unclear to what extent these post hoc explanation techniques may mislead end users by giving them a false sense of security, and luring them into trusting and deploying untrustworthy black boxes. This project will build rigorous frameworks to expose the vulnerabilities of existing explanation techniques, assess how these vulnerabilities can manifest in real world applications, and develop new techniques to defend against these vulnerabilities. This project has the potential to significantly speed up the adoption of ML in a variety of domains including criminal justice (e.g., bail decisions), health care (e.g., patient diagnosis and treatment), and financial lending (e.g., loan approval). The goal of this project is to characterize the vulnerabilities of existing explanation techniques, understand how adversaries can exploit these vulnerabilities, and develop techniques to defend against them. The project will focus on the following subtasks: 1) understanding the real-world consequences of misleading explanations by conducting user studies and detailed interviews with domain experts in healthcare and criminal justice 2) identifying critical vulnerabilities in state-of-the-art explanation techniques that can be exploited by adversarial entities to generate misleading explanations, and 3) developing novel techniques for building robust and reliable explanations that are not prone to these vulnerabilities and thereby provide domain experts and other stakeholders with faithful explanations of complex black box models. With these contributions, the project will initiate a new body of research in ML interpretability that focuses on understanding how adversaries can manipulate explanation techniques, and how to defend against such attacks. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →