15 Jul 2024 | Andy Arditi, Oscar Obeso, Aaqib Syed, Daniel Paleka, Nina Panicssery, Wes Gurnee, Neel Nanda
This paper investigates the mechanism behind refusal behavior in large language models (LLMs), showing that refusal is mediated by a single direction in the model's residual stream activations. Across 13 popular open-source chat models with up to 72B parameters, the researchers identify a single difference-in-means direction that can be manipulated to bypass refusal on harmful instructions or induce refusal on harmless ones. By modifying this direction, they develop a novel white-box jailbreak method that effectively disables refusal with minimal impact on other model capabilities. The study also demonstrates how adversarial suffixes suppress the propagation of the refusal-mediating direction, highlighting the vulnerability of current safety fine-tuning methods. The findings underscore the importance of understanding model internals for developing practical methods to control model behavior. The research provides a concrete example of how insights from model internals can be practically useful for improving model safety and understanding vulnerabilities. The work also shows that even a simple rank-one weight modification can nearly eliminate refusal behavior in open-source chat models, emphasizing the need for more robust safety mechanisms. The study contributes to the growing body of literature on the fragility of current safety mechanisms and the potential risks of open-source model misuse.This paper investigates the mechanism behind refusal behavior in large language models (LLMs), showing that refusal is mediated by a single direction in the model's residual stream activations. Across 13 popular open-source chat models with up to 72B parameters, the researchers identify a single difference-in-means direction that can be manipulated to bypass refusal on harmful instructions or induce refusal on harmless ones. By modifying this direction, they develop a novel white-box jailbreak method that effectively disables refusal with minimal impact on other model capabilities. The study also demonstrates how adversarial suffixes suppress the propagation of the refusal-mediating direction, highlighting the vulnerability of current safety fine-tuning methods. The findings underscore the importance of understanding model internals for developing practical methods to control model behavior. The research provides a concrete example of how insights from model internals can be practically useful for improving model safety and understanding vulnerabilities. The work also shows that even a simple rank-one weight modification can nearly eliminate refusal behavior in open-source chat models, emphasizing the need for more robust safety mechanisms. The study contributes to the growing body of literature on the fragility of current safety mechanisms and the potential risks of open-source model misuse.