Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

21 Aug 2024 | Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
The paper "Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs" by Abhay Sheshadri explores the challenges of making large language models (LLMs) more robust to undesirable behaviors that they are explicitly trained to avoid. The authors introduce targeted latent adversarial training (LAT) as a method to enhance the robustness of LLMs against persistent harmful behaviors, such as jailbreaking, backdoor removal, and unlearning of undesirable knowledge. **Key Contributions:** 1. **Proposed Method:** Targeted LAT is introduced to improve the robustness of LLMs to specific types of harmful behaviors. 2. **Experimental Results:** The paper demonstrates that targeted LAT can significantly enhance the effectiveness of existing fine-tuning and adversarial training methods in various scenarios, including: - **Jailbreaking:** Targeted LAT outperforms strong baselines like R2D2 with significantly less computational resources. - **Backdoor Removal:** It improves the ability to remove backdoors without significant side effects, even when the trigger is unknown. - **Unlearning:** It enhances the unlearning of undesirable knowledge, such as Harry Potter and potentially harmful biology and cyber knowledge, with minimal impact on general performance. **Methodology:** - **Latent Adversarial Training (LAT):** LAT involves perturbing the latent activations of the model to elicit specific failure modes and then fine-tuning the model to resist these perturbations. - **Targeted LAT:** Unlike untargeted LAT, which aims to maximize loss on desirable behavior, targeted LAT seeks to minimize loss on a specific competing task, making it more effective for removing specific undesirable behaviors. **Experiments:** - **Jailbreaking:** Targeted LAT significantly improves the robustness of LLMs to jailbreaking attacks, outperforming R2D2 with orders of magnitude less compute. - **Backdoor Removal:** Targeted LAT enhances the effectiveness of DPO (Direct Preference Optimization) in removing backdoors, even when the trigger is unknown. - **Unlearning:** Targeted LAT improves the unlearning of undesirable knowledge, such as Harry Potter and potentially harmful biology and cyber knowledge, with minimal impact on general performance. **Discussion:** - **Practical Value:** Targeted LAT is a valuable tool for improving the safety and security of LLMs. - **Challenges:** The technique is challenging to configure and tune, and it is limited to models with fewer than 10 billion parameters. - **Future Work:** The authors suggest further research on improving latent-space attacks, augmenting other latent-space manipulation techniques, and developing generalized adversarial attacks for LLM evaluations. Overall, the paper provides a comprehensive approach to enhancing the robustness of LLMs against persistent harmful behaviors, demonstrating the effectiveness of targeted LAT in various practical scenarios.The paper "Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs" by Abhay Sheshadri explores the challenges of making large language models (LLMs) more robust to undesirable behaviors that they are explicitly trained to avoid. The authors introduce targeted latent adversarial training (LAT) as a method to enhance the robustness of LLMs against persistent harmful behaviors, such as jailbreaking, backdoor removal, and unlearning of undesirable knowledge. **Key Contributions:** 1. **Proposed Method:** Targeted LAT is introduced to improve the robustness of LLMs to specific types of harmful behaviors. 2. **Experimental Results:** The paper demonstrates that targeted LAT can significantly enhance the effectiveness of existing fine-tuning and adversarial training methods in various scenarios, including: - **Jailbreaking:** Targeted LAT outperforms strong baselines like R2D2 with significantly less computational resources. - **Backdoor Removal:** It improves the ability to remove backdoors without significant side effects, even when the trigger is unknown. - **Unlearning:** It enhances the unlearning of undesirable knowledge, such as Harry Potter and potentially harmful biology and cyber knowledge, with minimal impact on general performance. **Methodology:** - **Latent Adversarial Training (LAT):** LAT involves perturbing the latent activations of the model to elicit specific failure modes and then fine-tuning the model to resist these perturbations. - **Targeted LAT:** Unlike untargeted LAT, which aims to maximize loss on desirable behavior, targeted LAT seeks to minimize loss on a specific competing task, making it more effective for removing specific undesirable behaviors. **Experiments:** - **Jailbreaking:** Targeted LAT significantly improves the robustness of LLMs to jailbreaking attacks, outperforming R2D2 with orders of magnitude less compute. - **Backdoor Removal:** Targeted LAT enhances the effectiveness of DPO (Direct Preference Optimization) in removing backdoors, even when the trigger is unknown. - **Unlearning:** Targeted LAT improves the unlearning of undesirable knowledge, such as Harry Potter and potentially harmful biology and cyber knowledge, with minimal impact on general performance. **Discussion:** - **Practical Value:** Targeted LAT is a valuable tool for improving the safety and security of LLMs. - **Challenges:** The technique is challenging to configure and tune, and it is limited to models with fewer than 10 billion parameters. - **Future Work:** The authors suggest further research on improving latent-space attacks, augmenting other latent-space manipulation techniques, and developing generalized adversarial attacks for LLM evaluations. Overall, the paper provides a comprehensive approach to enhancing the robustness of LLMs against persistent harmful behaviors, demonstrating the effectiveness of targeted LAT in various practical scenarios.
Reach us at info@study.space