Understanding SafeDecoding%3A Defending against Jailbreak Attacks via Safety-Aware Decoding

**SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding** **Authors:** Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran **Institution:** University of Washington, The Pennsylvania State University, Allen Institute for AI **Abstract:** As large language models (LLMs) are increasingly integrated into real-world applications, efforts have been made to align their behavior with human values, including safety. Jailbreak attacks, which aim to provoke unintended and unsafe behaviors from LLMs, remain a significant safety threat. This paper introduces SafeDecoding, a safety-aware decoding strategy to defend LLMs against jailbreak attacks. SafeDecoding identifies and amplifies safety disclaimers while attenuating token sequences aligned with the attacker's objectives. Extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets show that SafeDecoding significantly reduces attack success rates and harmfulness while maintaining helpfulness for benign user queries, outperforming six defense methods. **Introduction:** Jailbreak attacks, which exploit vulnerabilities in LLMs to generate harmful content, pose a significant safety threat. Existing defenses, such as input and output detection, are often ineffective or computationally expensive. SafeDecoding addresses this by leveraging token probabilities to identify and amplify safety disclaimers, while attenuating harmful token sequences. The method involves a two-phase process: training an expert model with safety instructions and constructing a new token distribution at inference time. Experiments demonstrate that SafeDecoding effectively reduces attack success rates and harmfulness, while maintaining helpfulness and efficiency. **Related Work:** The paper reviews existing defenses against jailbreak attacks, including detection-based and mitigation-based methods, and compares them with SafeDecoding. It also discusses the challenges and limitations of current approaches, such as the difficulty in identifying attacker goals and the need for efficient and helpful decoding strategies. **Preliminaries:** The paper provides a detailed explanation of the decoding process in LLMs and the objective of jailbreak attacks. It introduces the problem setup and the key observations that inform the development of SafeDecoding. **SafeDecoding:** SafeDecoding is designed to be computationally lightweight and effective. It constructs a new token distribution by combining the outputs of the original LLM and the expert model, ensuring that responses are both safe and helpful. **Experiments:** The effectiveness, helpfulness, efficiency, and compatibility of SafeDecoding are evaluated using various benchmarks and attack methods. Results show that SafeDecoding significantly reduces attack success rates and harmfulness while maintaining or improving helpfulness and efficiency. **Conclusion and Future Work:** SafeDecoding is a novel and effective method to defend LLMs against jailbreak attacks. Future work will explore its application to multimodal LLMs and address limitations such as semantic transitions and the need for randomized decoding**SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding** **Authors:** Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran **Institution:** University of Washington, The Pennsylvania State University, Allen Institute for AI **Abstract:** As large language models (LLMs) are increasingly integrated into real-world applications, efforts have been made to align their behavior with human values, including safety. Jailbreak attacks, which aim to provoke unintended and unsafe behaviors from LLMs, remain a significant safety threat. This paper introduces SafeDecoding, a safety-aware decoding strategy to defend LLMs against jailbreak attacks. SafeDecoding identifies and amplifies safety disclaimers while attenuating token sequences aligned with the attacker's objectives. Extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets show that SafeDecoding significantly reduces attack success rates and harmfulness while maintaining helpfulness for benign user queries, outperforming six defense methods. **Introduction:** Jailbreak attacks, which exploit vulnerabilities in LLMs to generate harmful content, pose a significant safety threat. Existing defenses, such as input and output detection, are often ineffective or computationally expensive. SafeDecoding addresses this by leveraging token probabilities to identify and amplify safety disclaimers, while attenuating harmful token sequences. The method involves a two-phase process: training an expert model with safety instructions and constructing a new token distribution at inference time. Experiments demonstrate that SafeDecoding effectively reduces attack success rates and harmfulness, while maintaining helpfulness and efficiency. **Related Work:** The paper reviews existing defenses against jailbreak attacks, including detection-based and mitigation-based methods, and compares them with SafeDecoding. It also discusses the challenges and limitations of current approaches, such as the difficulty in identifying attacker goals and the need for efficient and helpful decoding strategies. **Preliminaries:** The paper provides a detailed explanation of the decoding process in LLMs and the objective of jailbreak attacks. It introduces the problem setup and the key observations that inform the development of SafeDecoding. **SafeDecoding:** SafeDecoding is designed to be computationally lightweight and effective. It constructs a new token distribution by combining the outputs of the original LLM and the expert model, ensuring that responses are both safe and helpful. **Experiments:** The effectiveness, helpfulness, efficiency, and compatibility of SafeDecoding are evaluated using various benchmarks and attack methods. Results show that SafeDecoding significantly reduces attack success rates and harmfulness while maintaining or improving helpfulness and efficiency. **Conclusion and Future Work:** SafeDecoding is a novel and effective method to defend LLMs against jailbreak attacks. Future work will explore its application to multimodal LLMs and address limitations such as semantic transitions and the need for randomized decoding

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

25 Jul 2024 | Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran