SaTC: CORE: Small: Systematic Threat Characterization and Prevention in Open-Domain Dialog Systems
Virginia Polytechnic Institute And State University, Blacksburg VA
Investigators
Abstract
Dialog systems or chatbots powered by deep neural networks are increasingly being deployed at scale without understanding the vulnerabilities impacting them. Using specially designed learning algorithms, these chatbots are trained to learn from existing human-human conversation data to produce convincing conversations on a variety of topics. However, biases in the training data, including intentionally injected ones, can make these systems ripe for abuse by malicious actors who aim to trigger toxic or harmful conversations. This may expose vulnerable users to potential harms, given the lack of attention to security in existing deployments and the fact that they are used in sensitive domains such as healthcare, emotional support, and the U.S. justice system. This project will systematically characterize a variety of threats impacting chatbot systems, then build novel deployable defenses to measure toxicity, uncover hidden vulnerabilities, detoxify impacted systems, and enable attack-resilient training pipelines. The project will also create partnerships between multiple computer science disciplines and between industry and academia to raise awareness of and defend against these threats. The project provides unique opportunities to underrepresented K-12 students to study emerging topics in the field of machine learning and security, aiming to attract them towards STEM careers. This project has three research thrusts. The first is conducting a large-scale measurement study using widely used chatbot pipelines to characterize their vulnerability to unintentionally and intentionally injected toxicity. Toxicity injection attacks are characterized using a novel, fully automated pipeline that leverages large language models with minimal human supervision, allowing the methods to scale. The second thrust is developing a novel generative modeling approach to probe chatbots for hidden toxicity vulnerabilities, and to detoxify models and create safety benchmarks. The third thrust builds on the earlier findings to develop a novel attack-agnostic training pipeline that is resilient to toxicity injection attacks. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →