Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

14 Jun 2024 | Wei Zhao, Zhe Li, Yige Li, Ye Zhang, Jun Sun
This paper introduces Layer-specific Editing (LED), a novel defense method for large language models (LLMs) against jailbreak attacks. The authors investigate how LLMs respond to harmful prompts and propose LED to enhance their resilience. Through LED, they identify critical safety layers in early LLM layers that are essential for defending against harmful queries. By realigning these safety layers with safe responses from toxic layers, they significantly improve the alignment of LLMs against jailbreak attacks while maintaining performance on benign prompts. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show that LED effectively defends against jailbreak attacks. The paper highlights that while existing defense methods focus on detecting harmful prompts or reducing the likelihood of harmful responses, they often fail to address the underlying safety mechanisms of LLMs. LED leverages layer-wise analysis to identify and edit critical layers that contribute to the defense against harmful prompts. The method involves three steps: selecting edited layers, locating toxic layers, and performing layer-specific editing to align the decoded content from toxic layers with safe responses. The authors find that early layers in LLMs play a crucial role in defending against harmful prompts, and that removing these layers can lead to harmful responses. They also show that not all layers contain toxic information, and that some layers maintain a relatively high probability of decoding refusal tokens. LED successfully activates more layers in the defense, enhancing the model's robustness even without the help of the safe system message. Experiments demonstrate that LED outperforms other defense methods in terms of reducing the attack success rate (ASR) for jailbreak prompts and maintaining the helpfulness of LLMs on benign prompts. The method is effective across various LLMs and shows promising results in defending against state-of-the-art adversarial attacks. The paper also discusses the limitations of current defense methods and suggests further research into understanding the functions of different components of LLMs to refine defense mechanisms.This paper introduces Layer-specific Editing (LED), a novel defense method for large language models (LLMs) against jailbreak attacks. The authors investigate how LLMs respond to harmful prompts and propose LED to enhance their resilience. Through LED, they identify critical safety layers in early LLM layers that are essential for defending against harmful queries. By realigning these safety layers with safe responses from toxic layers, they significantly improve the alignment of LLMs against jailbreak attacks while maintaining performance on benign prompts. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show that LED effectively defends against jailbreak attacks. The paper highlights that while existing defense methods focus on detecting harmful prompts or reducing the likelihood of harmful responses, they often fail to address the underlying safety mechanisms of LLMs. LED leverages layer-wise analysis to identify and edit critical layers that contribute to the defense against harmful prompts. The method involves three steps: selecting edited layers, locating toxic layers, and performing layer-specific editing to align the decoded content from toxic layers with safe responses. The authors find that early layers in LLMs play a crucial role in defending against harmful prompts, and that removing these layers can lead to harmful responses. They also show that not all layers contain toxic information, and that some layers maintain a relatively high probability of decoding refusal tokens. LED successfully activates more layers in the defense, enhancing the model's robustness even without the help of the safe system message. Experiments demonstrate that LED outperforms other defense methods in terms of reducing the attack success rate (ASR) for jailbreak prompts and maintaining the helpfulness of LLMs on benign prompts. The method is effective across various LLMs and shows promising results in defending against state-of-the-art adversarial attacks. The paper also discusses the limitations of current defense methods and suggests further research into understanding the functions of different components of LLMs to refine defense mechanisms.
Reach us at info@study.space