Understanding Defending Against Unforeseen Failure Modes with Latent Adversarial Training

This paper explores the use of latent adversarial training (LAT) to defend against unforeseen failure modes in AI systems. LAT applies adversarial perturbations to the latent representations of a model, rather than the input space, to improve robustness to novel attacks and trojans. The authors argue that many failure modes are difficult to identify and trigger in the input space but may be easier to elicit from the latent space. They demonstrate that LAT outperforms traditional adversarial training (AT) in image classification, text classification, and text generation tasks, improving both clean data performance and robustness to novel attacks. The study also highlights instances where robustness techniques can sometimes harm robustness to unforeseen attacks, emphasizing the need for careful configuration of these methods. Overall, the paper suggests that LAT is a promising tool for enhancing the robustness of AI systems to unknown failure modes.This paper explores the use of latent adversarial training (LAT) to defend against unforeseen failure modes in AI systems. LAT applies adversarial perturbations to the latent representations of a model, rather than the input space, to improve robustness to novel attacks and trojans. The authors argue that many failure modes are difficult to identify and trigger in the input space but may be easier to elicit from the latent space. They demonstrate that LAT outperforms traditional adversarial training (AT) in image classification, text classification, and text generation tasks, improving both clean data performance and robustness to novel attacks. The study also highlights instances where robustness techniques can sometimes harm robustness to unforeseen attacks, emphasizing the need for careful configuration of these methods. Overall, the paper suggests that LAT is a promising tool for enhancing the robustness of AI systems to unknown failure modes.

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

22 Aug 2024 | Stephen Casper*, Lennart Schulze*, Oam Patel, Dylan Hadfield-Menell

22 Aug 2024 | Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell