SHF:Small:Differential Testing for Machine Learning Software
Purdue University, West Lafayette IN
Investigators
Abstract
Machine-learning systems, including deep-learning (DL) systems, demand reliability. DL systems consist of two key components: (1) models and algorithms that perform complex mathematical calculations, and (2) software that implements the algorithms and models. Here software includes DL infrastructure code (e.g., code that performs core neural-network computations) and the application code (e.g., code that loads model weights). Thus, for the entire DL system to be reliable, both the software implementation and models/algorithms must be reliable. If software fails to faithfully implement a model (e.g., due to a bug in the software), the output from the software can be wrong even if the model is correct, and vice versa. This project raises awareness of testing DL software implementations to find and localize defects in DL software. Specifically, this project develops end-to-end solutions to improve the reliability and robustness of DL systems by differential testing DL software (including both source code and data) to detect and localize bugs, and defend against adversarial input. In addition to advancing the state of the art, the findings, approaches, and tools developed in the project should provide educational and practical tools to test DL software. It could help transform how students, developers, and researchers test DL software. Testing DL software is challenging, as it is particularly difficult for developers to know the expected output of the software under test given an input instance, because DL algorithms and models use complex networks and mathematical formula. A second challenge is to identify the faulty functions among many in the DL software upon bug detection. To address these challenges, this project uses differential testing to detect and localize bugs in DL software, including code and data, without relying on the expected output. To achieve this goal, one must understand the variances of multiple training runs with identical configurations, e.g., identical DL software, identical algorithm, identical training and test data, identical network, etc. Such variance indicates nondeterminism in the DL algorithms and software implementations, which imposes both opportunities and challenges for researchers and practitioners. Building on the variance results, the first thrust creates differential-testing approaches to detect software bugs in both the DL training and inference phases, as well as when multiple implementations are unavailable, e.g., by mutating models. It addresses the fundamental oracle challenge of testing DL software. The second thrust builds approaches to isolate the differences of multiple runs to localize bugs to help developers identify the bug root causes, so that developers can fix them correctly faster. The last thrust tests the data of DL software to identify and defend against adversarial input. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →