[slides and audio] How Alignment and Jailbreak Work%3A Explain LLM Safety through Intermediate Hidden States

This paper explores how large language models (LLMs) ensure safety through alignment and the mechanisms of jailbreak. The authors use weak classifiers to explain LLM safety by analyzing intermediate hidden states during the forward pass. They confirm that LLMs learn ethical concepts during pre-training and can identify malicious and normal inputs in early layers. Alignment associates these early concepts with emotion guesses in middle layers, refining them into specific reject tokens for safe generations. Jailbreak disrupts this transformation, leading to harmful content generation. Experiments on models ranging from 7B to 70B parameters across various families validate these findings. The paper provides insights into LLM safety, enhancing transparency and contributing to responsible LLM development. The code is available at https://github.com/ydyjya/LLM-IHS-Explanation.This paper explores how large language models (LLMs) ensure safety through alignment and the mechanisms of jailbreak. The authors use weak classifiers to explain LLM safety by analyzing intermediate hidden states during the forward pass. They confirm that LLMs learn ethical concepts during pre-training and can identify malicious and normal inputs in early layers. Alignment associates these early concepts with emotion guesses in middle layers, refining them into specific reject tokens for safe generations. Jailbreak disrupts this transformation, leading to harmful content generation. Experiments on models ranging from 7B to 70B parameters across various families validate these findings. The paper provides insights into LLM safety, enhancing transparency and contributing to responsible LLM development. The code is available at https://github.com/ydyjya/LLM-IHS-Explanation.

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

13 Jun 2024 | Zhenhong Zhou1, Haiyang Yu1, Xinghua Zhang1, Rongwu Xu2, Fei Huang1, Yongbin Li1,*