24 Oct 2024 | Boyi Wei * Kaixuan Huang * Yangsibo Huang * Tinghao Xie Xiangyu Qi Mengzhou Xia Prateek Mittal Mengdi Wang † Peter Henderson †
This study explores the brittleness of safety alignment in large language models (LLMs) by leveraging pruning and low-rank modifications. The authors develop methods to identify critical regions that are essential for safety guardrails and disentangle them from utility-relevant regions at both the neuron and rank levels. Surprisingly, these critical regions are sparse, comprising about 3% of parameters and 2.5% of ranks. Removing these regions significantly compromises safety while only mildly impacting utility, highlighting the inherent brittleness of LLMs' safety mechanisms. Additionally, the study shows that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to safety-critical regions are restricted. These findings underscore the need for more robust safety strategies in LLMs. The research proposes initial strategies for improving safety robustness, such as pruning regions least important for safety and making safety-critical regions difficult to isolate. The study also suggests that MLP layers may encode more differentiated behaviors compared to attention layers, indicating potential directions for future research.This study explores the brittleness of safety alignment in large language models (LLMs) by leveraging pruning and low-rank modifications. The authors develop methods to identify critical regions that are essential for safety guardrails and disentangle them from utility-relevant regions at both the neuron and rank levels. Surprisingly, these critical regions are sparse, comprising about 3% of parameters and 2.5% of ranks. Removing these regions significantly compromises safety while only mildly impacting utility, highlighting the inherent brittleness of LLMs' safety mechanisms. Additionally, the study shows that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to safety-critical regions are restricted. These findings underscore the need for more robust safety strategies in LLMs. The research proposes initial strategies for improving safety robustness, such as pruning regions least important for safety and making safety-critical regions difficult to isolate. The study also suggests that MLP layers may encode more differentiated behaviors compared to attention layers, indicating potential directions for future research.