Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

29 Apr 2024 | Adib Hasan, Ileana Rugina, Alex Wang
This paper presents a method to increase the jailbreaking resistance of large language models (LLMs) without fine-tuning, using moderate WANDA pruning. The study demonstrates that pruning the attention layers of LLMs, specifically LLaMA-2 Chat, Vicuna 1.3, and Mistral Instruct v0.2, enhances their resistance to jailbreaking attacks. The results show that pruning between 10-20% increases the models' ability to resist harmful prompts, while pruning beyond 20% can reduce safety. The safety improvements are attributed to the regularizing effect of pruning, which helps models focus more effectively on task-relevant tokens in jailbreaking prompts. The study also introduces a dataset of 225 harmful tasks across five categories to systematically evaluate the safety enhancement. The findings indicate that models with higher initial safety levels benefit more from pruning, and that WANDA pruning leads to statistically significant performance improvements in domain shifts. The research also shows that pruned models maintain performance on standard benchmarks, suggesting that the improved safety is due to regularization rather than a reduction in language understanding. The study further analyzes the effects of pruning on perplexity and attention patterns, demonstrating that pruned models penalize malicious prompts and focus more on relevant tokens. The results highlight the potential of WANDA pruning as a method to enhance the safety of LLMs without the need for additional training or computational cost.This paper presents a method to increase the jailbreaking resistance of large language models (LLMs) without fine-tuning, using moderate WANDA pruning. The study demonstrates that pruning the attention layers of LLMs, specifically LLaMA-2 Chat, Vicuna 1.3, and Mistral Instruct v0.2, enhances their resistance to jailbreaking attacks. The results show that pruning between 10-20% increases the models' ability to resist harmful prompts, while pruning beyond 20% can reduce safety. The safety improvements are attributed to the regularizing effect of pruning, which helps models focus more effectively on task-relevant tokens in jailbreaking prompts. The study also introduces a dataset of 225 harmful tasks across five categories to systematically evaluate the safety enhancement. The findings indicate that models with higher initial safety levels benefit more from pruning, and that WANDA pruning leads to statistically significant performance improvements in domain shifts. The research also shows that pruned models maintain performance on standard benchmarks, suggesting that the improved safety is due to regularization rather than a reduction in language understanding. The study further analyzes the effects of pruning on perplexity and attention patterns, demonstrating that pruned models penalize malicious prompts and focus more on relevant tokens. The results highlight the potential of WANDA pruning as a method to enhance the safety of LLMs without the need for additional training or computational cost.
Reach us at info@study.space