How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

13 Jun 2024 | Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li
This paper explores how large language models (LLMs) ensure safety through intermediate hidden states and how jailbreak attacks bypass safety mechanisms. The study reveals that LLMs learn ethical concepts during pre-training rather than alignment, and these concepts are used to identify malicious and normal inputs in early layers. Alignment then associates early concepts with emotion guesses in middle layers and refines them into specific reject tokens for safe generation. Jailbreak disrupts the transformation of early unethical classification into negative emotions, leading to harmful outputs. The research uses weak classifiers to analyze intermediate hidden states and demonstrates that LLMs can accurately classify ethical and unethical inputs with over 95% accuracy. The study also shows that aligned models associate positive emotions with ethically compliant inputs and negative emotions with non-compliant ones in middle layers, ultimately converting these emotions into stylized tokens. Experiments on models from 7B to 70B across various families confirm these findings. Jailbreak inputs disrupt the association between early ethical beliefs and middle-layer emotions, leading to harmful outputs. The study proposes Logit Grafting, a method to approximate the disruption caused by jailbreak by grafting positive emotions from normal inputs onto jailbreak inputs. Experimental results show that Logit Grafting can mimic the effects of jailbreak, indicating that disrupting the association between early and middle layers leads to unsafe outputs. The paper concludes that LLMs ensure safety by learning ethical concepts during pre-training and aligning these concepts with emotional tokens in middle layers. Alignment acts as a conceptual bridge, associating unethical inputs with negative emotions to ensure harmless outputs. The study provides new insights into LLM safety, enhancing transparency and contributing to the development of responsible LLMs. The code and datasets are available for further research.This paper explores how large language models (LLMs) ensure safety through intermediate hidden states and how jailbreak attacks bypass safety mechanisms. The study reveals that LLMs learn ethical concepts during pre-training rather than alignment, and these concepts are used to identify malicious and normal inputs in early layers. Alignment then associates early concepts with emotion guesses in middle layers and refines them into specific reject tokens for safe generation. Jailbreak disrupts the transformation of early unethical classification into negative emotions, leading to harmful outputs. The research uses weak classifiers to analyze intermediate hidden states and demonstrates that LLMs can accurately classify ethical and unethical inputs with over 95% accuracy. The study also shows that aligned models associate positive emotions with ethically compliant inputs and negative emotions with non-compliant ones in middle layers, ultimately converting these emotions into stylized tokens. Experiments on models from 7B to 70B across various families confirm these findings. Jailbreak inputs disrupt the association between early ethical beliefs and middle-layer emotions, leading to harmful outputs. The study proposes Logit Grafting, a method to approximate the disruption caused by jailbreak by grafting positive emotions from normal inputs onto jailbreak inputs. Experimental results show that Logit Grafting can mimic the effects of jailbreak, indicating that disrupting the association between early and middle layers leads to unsafe outputs. The paper concludes that LLMs ensure safety by learning ethical concepts during pre-training and aligning these concepts with emotional tokens in middle layers. Alignment acts as a conceptual bridge, associating unethical inputs with negative emotions to ensure harmless outputs. The study provides new insights into LLM safety, enhancing transparency and contributing to the development of responsible LLMs. The code and datasets are available for further research.
Reach us at info@study.space