10 Jun 2024 | Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson
Current large language models (LLMs) suffer from significant safety alignment vulnerabilities, as even simple attacks or benign fine-tuning can jailbreak aligned models. This paper argues that the root cause of these vulnerabilities lies in "shallow safety alignment," where safety alignment primarily adapts the model's generative distribution over only the first few output tokens. This shallow alignment makes models vulnerable to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. The paper presents case studies and experiments showing that current aligned LLMs are susceptible to these attacks because their safety alignment is shallow. It also proposes solutions to deepen safety alignment, such as data augmentation and a constrained optimization objective that makes safety alignment more persistent against fine-tuning attacks. The paper advocates for future safety alignment to be more than just a few tokens deep, as this would significantly improve robustness against common exploits. The findings suggest that deepening safety alignment can lead to stronger robustness against various vulnerabilities and provide a path forward for mitigating these issues. The paper also discusses the broader implications of shallow safety alignment and the importance of developing more robust safety alignment strategies.Current large language models (LLMs) suffer from significant safety alignment vulnerabilities, as even simple attacks or benign fine-tuning can jailbreak aligned models. This paper argues that the root cause of these vulnerabilities lies in "shallow safety alignment," where safety alignment primarily adapts the model's generative distribution over only the first few output tokens. This shallow alignment makes models vulnerable to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. The paper presents case studies and experiments showing that current aligned LLMs are susceptible to these attacks because their safety alignment is shallow. It also proposes solutions to deepen safety alignment, such as data augmentation and a constrained optimization objective that makes safety alignment more persistent against fine-tuning attacks. The paper advocates for future safety alignment to be more than just a few tokens deep, as this would significantly improve robustness against common exploits. The findings suggest that deepening safety alignment can lead to stronger robustness against various vulnerabilities and provide a path forward for mitigating these issues. The paper also discusses the broader implications of shallow safety alignment and the importance of developing more robust safety alignment strategies.