This paper introduces a novel approach called Decoupled Refusal Training (DeRTa) to improve the safety of Large Language Models (LLMs) by addressing a critical issue in safety tuning data: refusal position bias. This bias leads to LLMs being unable to effectively refuse generating unsafe content, as they are trained to make refusal decisions only at the beginning of responses. DeRTa introduces two key components to address this issue: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
The proposed method is evaluated using LLaMA3 and Mistral model families across six attack scenarios, demonstrating that DeRTa not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. The approach successfully defends recent advanced attack methods, including CodeAttack, which have jailbroken GPT-4 and LLaMA3-70B-Instruct.
The study highlights the importance of addressing refusal position bias in safety tuning data to ensure LLMs can effectively refuse generating unsafe content at any point during the response. The results show that DeRTa significantly enhances the safety of LLMs by enabling them to recognize and halt the generation of unsafe content when potential risks are detected. The method is effective across different model architectures and sizes, demonstrating its universality and robustness. The findings also emphasize the need to consider the role of safety tuning data and the inherent biases that may affect an LLM’s ability to make refusal decisions effectively.This paper introduces a novel approach called Decoupled Refusal Training (DeRTa) to improve the safety of Large Language Models (LLMs) by addressing a critical issue in safety tuning data: refusal position bias. This bias leads to LLMs being unable to effectively refuse generating unsafe content, as they are trained to make refusal decisions only at the beginning of responses. DeRTa introduces two key components to address this issue: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
The proposed method is evaluated using LLaMA3 and Mistral model families across six attack scenarios, demonstrating that DeRTa not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. The approach successfully defends recent advanced attack methods, including CodeAttack, which have jailbroken GPT-4 and LLaMA3-70B-Instruct.
The study highlights the importance of addressing refusal position bias in safety tuning data to ensure LLMs can effectively refuse generating unsafe content at any point during the response. The results show that DeRTa significantly enhances the safety of LLMs by enabling them to recognize and halt the generation of unsafe content when potential risks are detected. The method is effective across different model architectures and sizes, demonstrating its universality and robustness. The findings also emphasize the need to consider the role of safety tuning data and the inherent biases that may affect an LLM’s ability to make refusal decisions effectively.