20 Jun 2024 | Jianhui Chen*, Xiaozhi Wang*, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li
This paper explores the mechanisms behind safety alignment in large language models (LLMs) by identifying and analyzing *safety neurons*. Safety neurons are responsible for safety behaviors and are responsible for 90% of safety performance with only about 5% of all neurons. The authors propose two methods: *generation-time activation contrasting* to locate these neurons and *dynamic activation patching* to evaluate their causal effects. Experiments on multiple LLMs show that safety neurons are sparse, effective, and transferable across different datasets. They also reveal that safety and helpfulness neurons share key neurons but require different activation patterns. The findings provide insights into the "alignment tax" phenomenon, where enhancing safety comes at the cost of helpfulness. Additionally, the authors demonstrate an application of safety neurons in detecting unsafe outputs before generation, improving model safety. The source code will be publicly released to facilitate future research.This paper explores the mechanisms behind safety alignment in large language models (LLMs) by identifying and analyzing *safety neurons*. Safety neurons are responsible for safety behaviors and are responsible for 90% of safety performance with only about 5% of all neurons. The authors propose two methods: *generation-time activation contrasting* to locate these neurons and *dynamic activation patching* to evaluate their causal effects. Experiments on multiple LLMs show that safety neurons are sparse, effective, and transferable across different datasets. They also reveal that safety and helpfulness neurons share key neurons but require different activation patterns. The findings provide insights into the "alignment tax" phenomenon, where enhancing safety comes at the cost of helpfulness. Additionally, the authors demonstrate an application of safety neurons in detecting unsafe outputs before generation, improving model safety. The source code will be publicly released to facilitate future research.