[slides] Finding Safety Neurons in Large Language Models

This paper explores the inner mechanisms of safety alignment in large language models (LLMs) through mechanistic interpretability, focusing on identifying and analyzing safety neurons responsible for safety behaviors. The authors propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that safety neurons are sparse and effective, with interventions on about 5% of all neurons restoring 90% safety performance. Safety neurons encode transferable mechanisms, showing consistent effectiveness across different red-teaming datasets. The findings also interpret the "alignment tax," where safety alignment enhances model safety but sacrifices model helpfulness. The key neurons for safety and helpfulness significantly overlap, but require different activation patterns. The study also demonstrates an application of safety neurons in detecting unsafe outputs before generation, improving model safety by refusing to respond when harmful content is detected. The findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.This paper explores the inner mechanisms of safety alignment in large language models (LLMs) through mechanistic interpretability, focusing on identifying and analyzing safety neurons responsible for safety behaviors. The authors propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that safety neurons are sparse and effective, with interventions on about 5% of all neurons restoring 90% safety performance. Safety neurons encode transferable mechanisms, showing consistent effectiveness across different red-teaming datasets. The findings also interpret the "alignment tax," where safety alignment enhances model safety but sacrifices model helpfulness. The key neurons for safety and helpfulness significantly overlap, but require different activation patterns. The study also demonstrates an application of safety neurons in detecting unsafe outputs before generation, improving model safety by refusing to respond when harmful content is detected. The findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.

Finding Safety Neurons in Large Language Models

20 Jun 2024 | Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li