III: Small: Foundations of Trustworthy Deep Learning: Interpretable Neural Network models with Robustness Guarantees
University Of California-San Diego, La Jolla CA
Investigators
Abstract
Deep neural networks have achieved remarkable success in fields ranging from computer vision to healthcare and autonomous driving. However, their susceptibility to various failure modes and blindspots can pose significant risks, especially in safety-critical and high-stakes applications. Given their complex and opaque “black-box” nature, understanding why and when deep neural networks fail is crucial for their safe real-world deployment. Existing mainstream interpretability methods for neural networks are limited—they focus mainly on subjective explanations based on influential features, fail to scale or explain the internal network processes, and struggle with minor input changes, which can be risky in high-stakes applications. This project aims to develop an automated framework for interpreting neural networks and to design robust, neural network models based on human-understandable concepts. These advancements will promote automation and scalability, ensure transparent decision-making, facilitate efficient model debugging, and enable timely intervention, leading to safer, more reliable, and widely trusted applications of deep learning technology in critical domains. This project will develop methods that can be used to ensure modern deep neural network models are interpretable and trustworthy. It includes methods for (1) automating interpretations that describe the internal functioning of a deep neural network via human-understandable concepts without the need to collect curated and expert annotations; (2) learning intrinsically interpretable models that contain task-relevant human-interpretable concepts by design; (3) quantifying and ensuring the robustness and reliability of the generated interpretations and the neural network models. The methods will draw on the investigator’s expertise in trustworthy machine learning and neural network robustness verification techniques to develop scalable and automated methods that will promote interpretability, robustness, transparency, and reliability in deep learning. If successful, this project will guide the design of deep learning systems to guarantee transparency and robustness, and provide the tools needed to enforce these properties during model development and deployment. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →