Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

2024 | Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
This study investigates the brittleness of safety alignment in large language models (LLMs) by analyzing the critical regions responsible for safety guardrails. Using pruning and low-rank modifications, the researchers identify safety-critical neurons and ranks that are essential for maintaining safety while minimizing impact on utility. The findings reveal that safety-critical regions are sparse, accounting for about 3% of the weights and 2.5% of the ranks. Removing these regions compromises safety while only mildly affecting utility, highlighting the inherent fragility of LLM safety mechanisms. Furthermore, even when safety-critical regions are restricted, LLMs remain vulnerable to low-cost fine-tuning attacks, emphasizing the need for more robust safety strategies. The study also shows that freezing safety-critical neurons does not prevent fine-tuning attacks, as new pathways can bypass existing safety mechanisms. The results suggest that the sparsity of safety-critical regions may explain the brittleness of safety alignment in LLMs, and that these regions could serve as an intrinsic metric for assessing safety alignment robustness. The research contributes to the development of more reliable safety alignment methods by identifying and isolating safety-critical components, which can enhance model safety while maintaining utility. The study underscores the importance of further research into robust safety mechanisms to counteract adversarial attacks and improve the overall safety of LLMs.This study investigates the brittleness of safety alignment in large language models (LLMs) by analyzing the critical regions responsible for safety guardrails. Using pruning and low-rank modifications, the researchers identify safety-critical neurons and ranks that are essential for maintaining safety while minimizing impact on utility. The findings reveal that safety-critical regions are sparse, accounting for about 3% of the weights and 2.5% of the ranks. Removing these regions compromises safety while only mildly affecting utility, highlighting the inherent fragility of LLM safety mechanisms. Furthermore, even when safety-critical regions are restricted, LLMs remain vulnerable to low-cost fine-tuning attacks, emphasizing the need for more robust safety strategies. The study also shows that freezing safety-critical neurons does not prevent fine-tuning attacks, as new pathways can bypass existing safety mechanisms. The results suggest that the sparsity of safety-critical regions may explain the brittleness of safety alignment in LLMs, and that these regions could serve as an intrinsic metric for assessing safety alignment robustness. The research contributes to the development of more reliable safety alignment methods by identifying and isolating safety-critical components, which can enhance model safety while maintaining utility. The study underscores the importance of further research into robust safety mechanisms to counteract adversarial attacks and improve the overall safety of LLMs.
Reach us at info@study.space