Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

6 Jun 2024 | Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
Emulated disalignment (ED) is a training-free attack method that reverses safety alignment in large language models (LLMs), potentially increasing their harmfulness. The method leverages the output token distributions of safety-aligned and pre-trained models to shift token predictions in the opposite direction of safety alignment. By contrasting the output distributions of a safety-aligned model (e.g., Llama-2-chat) with its pre-trained version (e.g., Llama-2), ED emulates the result of fine-tuning to minimize a safety reward, leading to harmful outputs. Experiments across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets. ED's reliance on language model output token distributions, particularly compromising open-source models, highlights the need to reassess the open accessibility of language models even if they have been safety-aligned. The method is training-free, using only access to output token distributions, and can be applied to different model families with shared vocabularies. ED challenges the notion that the open release of LLMs when done safely is a net benefit to society, as both pre-trained and safety-aligned models can be exploited for malicious purposes. The findings suggest that stronger alignment may lead to greater potential for harm, and ED can be competitive with training-based direct disalignment. The method has implications for the safety and security of language models, emphasizing the need for robust alignment algorithms and careful consideration of open accessibility.Emulated disalignment (ED) is a training-free attack method that reverses safety alignment in large language models (LLMs), potentially increasing their harmfulness. The method leverages the output token distributions of safety-aligned and pre-trained models to shift token predictions in the opposite direction of safety alignment. By contrasting the output distributions of a safety-aligned model (e.g., Llama-2-chat) with its pre-trained version (e.g., Llama-2), ED emulates the result of fine-tuning to minimize a safety reward, leading to harmful outputs. Experiments across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets. ED's reliance on language model output token distributions, particularly compromising open-source models, highlights the need to reassess the open accessibility of language models even if they have been safety-aligned. The method is training-free, using only access to output token distributions, and can be applied to different model families with shared vocabularies. ED challenges the notion that the open release of LLMs when done safely is a net benefit to society, as both pre-trained and safety-aligned models can be exploited for malicious purposes. The findings suggest that stronger alignment may lead to greater potential for harm, and ED can be competitive with training-based direct disalignment. The method has implications for the safety and security of language models, emphasizing the need for robust alignment algorithms and careful consideration of open accessibility.
Reach us at info@study.space