SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

25 Jul 2024 | Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
SafeDecoding is a safety-aware decoding strategy designed to defend large language models (LLMs) against jailbreak attacks. The paper highlights that even though harmful token probabilities may be higher, safety disclaimers often appear in the top tokens when sorted by probability. SafeDecoding mitigates jailbreak attacks by identifying and amplifying the probabilities of safety disclaimers while reducing the probabilities of token sequences aligned with jailbreak objectives. The method involves training an expert model with safety instructions and using it during inference to construct a new token distribution that balances utility and safety. SafeDecoding is evaluated on five LLMs using six jailbreak attacks and four benchmark datasets, showing significant reductions in attack success rates and harmfulness without compromising the helpfulness of responses to benign queries. It outperforms six existing defense methods in terms of effectiveness, efficiency, and compatibility. The method is computationally efficient and maintains the original LLM's helpfulness for benign users. The paper also discusses related work, including jailbreak attacks and existing defenses, and presents ablation studies on hyperparameters. The results demonstrate that SafeDecoding is effective in enhancing LLM safety while maintaining utility and efficiency.SafeDecoding is a safety-aware decoding strategy designed to defend large language models (LLMs) against jailbreak attacks. The paper highlights that even though harmful token probabilities may be higher, safety disclaimers often appear in the top tokens when sorted by probability. SafeDecoding mitigates jailbreak attacks by identifying and amplifying the probabilities of safety disclaimers while reducing the probabilities of token sequences aligned with jailbreak objectives. The method involves training an expert model with safety instructions and using it during inference to construct a new token distribution that balances utility and safety. SafeDecoding is evaluated on five LLMs using six jailbreak attacks and four benchmark datasets, showing significant reductions in attack success rates and harmfulness without compromising the helpfulness of responses to benign queries. It outperforms six existing defense methods in terms of effectiveness, efficiency, and compatibility. The method is computationally efficient and maintains the original LLM's helpfulness for benign users. The paper also discusses related work, including jailbreak attacks and existing defenses, and presents ablation studies on hyperparameters. The results demonstrate that SafeDecoding is effective in enhancing LLM safety while maintaining utility and efficiency.
Reach us at info@study.space