21 Aug 2024 | Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Slight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
Latent Adversarial Training (LAT) improves robustness to persistent harmful behaviors in large language models (LLMs). This paper introduces targeted LAT, which differs from untargeted LAT by focusing on specific harmful behaviors rather than general loss maximization. Targeted LAT is applied to enhance robustness against jailbreaking, backdoor removal, and unlearning of undesirable knowledge. The method involves perturbing latent representations to elicit harmful behaviors and then fine-tuning the model to resist these behaviors.
The results show that targeted LAT significantly improves robustness to jailbreaking attacks, outperforming existing methods like R2D2 with much less computational cost. It also enhances backdoor removal without knowing the trigger and improves unlearning of harmful knowledge more effectively and robustly than previous methods.
The paper demonstrates that targeted LAT can be combined with and improve upon a wide range of state-of-the-art techniques. It is shown to be effective in improving the robustness of LLMs against various harmful behaviors, including jailbreaking, backdoor attacks, and undesirable knowledge. The method is also found to be more efficient in reducing the sample efficiency of re-learning previously unlearned knowledge.
The study highlights the importance of addressing persistent harmful behaviors in LLMs and shows that targeted LAT is a practical and effective tool for improving the safety and security of these models. The results suggest that targeted LAT can be a valuable addition to existing methods for making LLMs more robust to harmful behaviors.Latent Adversarial Training (LAT) improves robustness to persistent harmful behaviors in large language models (LLMs). This paper introduces targeted LAT, which differs from untargeted LAT by focusing on specific harmful behaviors rather than general loss maximization. Targeted LAT is applied to enhance robustness against jailbreaking, backdoor removal, and unlearning of undesirable knowledge. The method involves perturbing latent representations to elicit harmful behaviors and then fine-tuning the model to resist these behaviors.
The results show that targeted LAT significantly improves robustness to jailbreaking attacks, outperforming existing methods like R2D2 with much less computational cost. It also enhances backdoor removal without knowing the trigger and improves unlearning of harmful knowledge more effectively and robustly than previous methods.
The paper demonstrates that targeted LAT can be combined with and improve upon a wide range of state-of-the-art techniques. It is shown to be effective in improving the robustness of LLMs against various harmful behaviors, including jailbreaking, backdoor attacks, and undesirable knowledge. The method is also found to be more efficient in reducing the sample efficiency of re-learning previously unlearned knowledge.
The study highlights the importance of addressing persistent harmful behaviors in LLMs and shows that targeted LAT is a practical and effective tool for improving the safety and security of these models. The results suggest that targeted LAT can be a valuable addition to existing methods for making LLMs more robust to harmful behaviors.