[slides] Refusal in Language Models Is Mediated by a Single Direction

This paper investigates the mechanism behind the refusal behavior of large language models (LLMs) in response to harmful and benign instructions. The authors find that this behavior is mediated by a single one-dimensional subspace across 13 popular open-source chat models with parameters up to 72 billion. They identify a specific direction in the model's residual stream activations that, when removed, prevents the model from refusing harmful instructions, while adding this direction elicits refusal on harmless instructions. Leveraging this insight, they propose a novel white-box jailbreak method that disables refusal with minimal impact on other model capabilities. The study also analyzes how adversarial suffixes suppress the propagation of the refusal-mediating direction, highlighting the brittleness of current safety fine-tuning methods. The findings underscore the need for more robust safety mechanisms and provide practical insights into controlling model behavior.This paper investigates the mechanism behind the refusal behavior of large language models (LLMs) in response to harmful and benign instructions. The authors find that this behavior is mediated by a single one-dimensional subspace across 13 popular open-source chat models with parameters up to 72 billion. They identify a specific direction in the model's residual stream activations that, when removed, prevents the model from refusing harmful instructions, while adding this direction elicits refusal on harmless instructions. Leveraging this insight, they propose a novel white-box jailbreak method that disables refusal with minimal impact on other model capabilities. The study also analyzes how adversarial suffixes suppress the propagation of the refusal-mediating direction, highlighting the brittleness of current safety fine-tuning methods. The findings underscore the need for more robust safety mechanisms and provide practical insights into controlling model behavior.

Refusal in Language Models Is Mediated by a Single Direction

15 Jul 2024 | Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda