Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

6 Jun 2024 | Zanhui Zhou†, Jie Liu*, Zhichen Dong*, Jiaheng Liu, Chao Yang†, Wanli Ouyang, Yu Qiao
This paper introduces a training-free attack method called *emulated disalignment* (ED) that reverses safety alignment in large language models (LLMs). Safety alignment is a process that ensures LLMs provide helpful responses while being safe. However, ED demonstrates that this alignment can be exploited to generate harmful content without additional training. The method achieves this by contrasting the output token distribution of a safety-aligned LLM with its pre-trained version, shifting the predictions towards the opposite direction of safety alignment. Experiments across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets. The findings highlight the need to reassess the open accessibility of language models, even if they have been safety-aligned, due to the security risks posed by ED's reliance on access to output token distributions.This paper introduces a training-free attack method called *emulated disalignment* (ED) that reverses safety alignment in large language models (LLMs). Safety alignment is a process that ensures LLMs provide helpful responses while being safe. However, ED demonstrates that this alignment can be exploited to generate harmful content without additional training. The method achieves this by contrasting the output token distribution of a safety-aligned LLM with its pre-trained version, shifting the predictions towards the opposite direction of safety alignment. Experiments across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets. The findings highlight the need to reassess the open accessibility of language models, even if they have been safety-aligned, due to the security risks posed by ED's reliance on access to output token distributions.
Reach us at info@study.space
[slides and audio] Emulated Disalignment%3A Safety Alignment for Large Language Models May Backfire!