Refuse Whenever You Feel Unsafe: IMPROVING SAFETY IN LLMs VIA DECOUPLED REFUSAL TRAINING

Refuse Whenever You Feel Unsafe: IMPROVING SAFETY IN LLMs VIA DECOUPLED REFUSAL TRAINING

12 Jul 2024 | Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu
This paper addresses the critical issue of refusal position bias in safety tuning data for Large Language Models (LLMs), which hinders their ability to effectively refuse generating unsafe content. The authors propose a novel approach called Decoupled Refusal Training (DeRTa), which includes two key components: Maximum Likelihood Estimation (MLE) with Harmful Response Prefix and Reinforced Transition Optimization (RTO). MLE with Harmful Response Prefix trains models to recognize and avoid unsafe content by appending a segment of harmful responses to safe responses, while RTO ensures that models can transition from potential harm to safety refusal consistently throughout the harmful response sequence. Empirical evaluations using LLaMA3 and Mistral models across six attack scenarios demonstrate that DeRTa significantly improves model safety without compromising performance, outperforming models like GPT-4 in defending against advanced attacks such as CodeAttack and CompletingAttack. The method also successfully defends against jailbreak attacks that have bypassed GPT-4 and LLaMA3-70B-Instruct. The paper highlights the importance of addressing refusal position bias in safety tuning data and provides a robust strategy to enhance LLMs' safety capabilities.This paper addresses the critical issue of refusal position bias in safety tuning data for Large Language Models (LLMs), which hinders their ability to effectively refuse generating unsafe content. The authors propose a novel approach called Decoupled Refusal Training (DeRTa), which includes two key components: Maximum Likelihood Estimation (MLE) with Harmful Response Prefix and Reinforced Transition Optimization (RTO). MLE with Harmful Response Prefix trains models to recognize and avoid unsafe content by appending a segment of harmful responses to safe responses, while RTO ensures that models can transition from potential harm to safety refusal consistently throughout the harmful response sequence. Empirical evaluations using LLaMA3 and Mistral models across six attack scenarios demonstrate that DeRTa significantly improves model safety without compromising performance, outperforming models like GPT-4 in defending against advanced attacks such as CodeAttack and CompletingAttack. The method also successfully defends against jailbreak attacks that have bypassed GPT-4 and LLaMA3-70B-Instruct. The paper highlights the importance of addressing refusal position bias in safety tuning data and provides a robust strategy to enhance LLMs' safety capabilities.
Reach us at info@study.space