Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

14 Jun 2024 | Wei Zhao*1, Zhe Li*1, Yige Li1, Ye Zhang2, Jun Sun1,
This paper addresses the vulnerability of large language models (LLMs) to jailbreak attacks, which are adversarial prompts designed to elicit unintended and unsafe responses. Despite the impressive performance of LLMs, they are susceptible to such attacks, even when aligned through Reinforcement Learning from Human Feedback or supervised fine-tuning. Existing defense methods focus on detecting harmful prompts or reducing the likelihood of harmful responses, but they do not explore the inner mechanisms of LLMs to enhance their resilience against jailbreak attacks. The authors propose a novel defense method called Layer-specific Editing (LED), which enhances the safety alignment of LLMs by realigning critical safety layers and additional layers with decoded safe responses from identified toxic layers. Through extensive experiments on various LLMs (e.g., Llama2, Mistral), LED demonstrates effectiveness in defending against jailbreak attacks while maintaining performance on benign prompts. The key contributions include: 1. **Finding Critical Safety Layers**: LED identifies early layers as crucial for handling harmful prompts, as removing these layers significantly increases the attack success rate. 2. **Observing Toxic Layers**: While jailbreak prompts cause LLMs to generate harmful responses, not all layers are successfully attacked. Some layers show a high probability of decoding refusal tokens, indicating that jailbreak attacks may alter the final response rather than intermediate outputs. 3. **Proposing LED**: LED uses targeted model editing to enhance LLM defense against adversarial attacks by realigning safety layers and additional layers with decoded safe responses from toxic layers. The paper also includes a detailed analysis of the effectiveness of LED compared to other defense methods, showing that LED consistently outperforms them in reducing attack success rates and preserving helpfulness on benign benchmarks. The authors conclude by discussing the limitations and future directions, emphasizing the need for further research to refine defense mechanisms and broaden their applicability.This paper addresses the vulnerability of large language models (LLMs) to jailbreak attacks, which are adversarial prompts designed to elicit unintended and unsafe responses. Despite the impressive performance of LLMs, they are susceptible to such attacks, even when aligned through Reinforcement Learning from Human Feedback or supervised fine-tuning. Existing defense methods focus on detecting harmful prompts or reducing the likelihood of harmful responses, but they do not explore the inner mechanisms of LLMs to enhance their resilience against jailbreak attacks. The authors propose a novel defense method called Layer-specific Editing (LED), which enhances the safety alignment of LLMs by realigning critical safety layers and additional layers with decoded safe responses from identified toxic layers. Through extensive experiments on various LLMs (e.g., Llama2, Mistral), LED demonstrates effectiveness in defending against jailbreak attacks while maintaining performance on benign prompts. The key contributions include: 1. **Finding Critical Safety Layers**: LED identifies early layers as crucial for handling harmful prompts, as removing these layers significantly increases the attack success rate. 2. **Observing Toxic Layers**: While jailbreak prompts cause LLMs to generate harmful responses, not all layers are successfully attacked. Some layers show a high probability of decoding refusal tokens, indicating that jailbreak attacks may alter the final response rather than intermediate outputs. 3. **Proposing LED**: LED uses targeted model editing to enhance LLM defense against adversarial attacks by realigning safety layers and additional layers with decoded safe responses from toxic layers. The paper also includes a detailed analysis of the effectiveness of LED compared to other defense methods, showing that LED consistently outperforms them in reducing attack success rates and preserving helpfulness on benign benchmarks. The authors conclude by discussing the limitations and future directions, emphasizing the need for further research to refine defense mechanisms and broaden their applicability.
Reach us at info@study.space