Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

22 Aug 2024 | Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell
This paper introduces latent adversarial training (LAT) as a method to defend against unforeseen failure modes in AI systems without requiring knowledge of the specific vulnerabilities or inputs that trigger them. Unlike traditional adversarial training (AT), which applies perturbations to input data, LAT applies perturbations to the latent representations used by the model for prediction. This approach leverages the compressed, abstract, and structured latent representations that models develop, making it possible to activate neural circuitry that elicits failures without requiring inputs that trigger them. The authors demonstrate that LAT can improve robustness to novel attacks and performance on clean data in image classification, text classification, and text generation tasks. They also show that LAT can help defend against trojans and other types of adversarial attacks. However, they also highlight that robustness techniques can sometimes harm robustness to novel failure modes, and that poorly configured AT and LAT can entrench trojans in models. The study suggests that LAT may be a promising tool for defending against failure modes that are not explicitly identified by developers.This paper introduces latent adversarial training (LAT) as a method to defend against unforeseen failure modes in AI systems without requiring knowledge of the specific vulnerabilities or inputs that trigger them. Unlike traditional adversarial training (AT), which applies perturbations to input data, LAT applies perturbations to the latent representations used by the model for prediction. This approach leverages the compressed, abstract, and structured latent representations that models develop, making it possible to activate neural circuitry that elicits failures without requiring inputs that trigger them. The authors demonstrate that LAT can improve robustness to novel attacks and performance on clean data in image classification, text classification, and text generation tasks. They also show that LAT can help defend against trojans and other types of adversarial attacks. However, they also highlight that robustness techniques can sometimes harm robustness to novel failure modes, and that poorly configured AT and LAT can entrench trojans in models. The study suggests that LAT may be a promising tool for defending against failure modes that are not explicitly identified by developers.
Reach us at info@study.space