29 Apr 2024 | Adib Hasan, Ileana Rugina, Alex Wang
This paper explores how moderate WANDA pruning can enhance the resistance of large language models (LLMs) to jailbreaking attacks without requiring fine-tuning. The study demonstrates that pruning the attention layers of LLMs, particularly with WANDA pruning, increases their resistance to harmful prompts while maintaining performance on standard benchmarks. The research introduces a dataset of 225 harmful tasks across five categories to evaluate the safety improvements from pruning. The findings suggest that the benefits of pruning correlate with the initial safety levels of the model, indicating a regularizing effect of WANDA pruning. The study also shows that pruning helps LLMs focus more effectively on task-relevant tokens within jailbreaking prompts and increases the refusal rate of malicious tasks. The results indicate that moderate pruning (10-20%) improves safety, while excessive pruning (30% or more) can reduce safety. The study further demonstrates that pruning leads to statistically significant performance improvements under domain shifts when applied to linear models. The paper also provides insights into the effects of pruning on attention patterns and perplexity, showing that pruned models are more sensitive to deviations from expected language distributions. The research concludes that WANDA pruning can improve the safety of LLMs by regularizing their behavior, making them more resistant to adversarial attacks. The study also highlights the importance of understanding the effects of pruning on different aspects of LLM performance, including safety, reasoning, and language modeling. The paper proposes that moderate WANDA pruning can be used to enhance the safety of LLMs without compromising their performance. The study also discusses the potential risks of open-sourcing datasets of jailbreaking prompts and malicious tasks, which could be used to generate more dangerous content. Overall, the research provides valuable insights into the effects of pruning on LLM safety and performance, suggesting that moderate pruning can be an effective strategy for improving the safety of large language models.This paper explores how moderate WANDA pruning can enhance the resistance of large language models (LLMs) to jailbreaking attacks without requiring fine-tuning. The study demonstrates that pruning the attention layers of LLMs, particularly with WANDA pruning, increases their resistance to harmful prompts while maintaining performance on standard benchmarks. The research introduces a dataset of 225 harmful tasks across five categories to evaluate the safety improvements from pruning. The findings suggest that the benefits of pruning correlate with the initial safety levels of the model, indicating a regularizing effect of WANDA pruning. The study also shows that pruning helps LLMs focus more effectively on task-relevant tokens within jailbreaking prompts and increases the refusal rate of malicious tasks. The results indicate that moderate pruning (10-20%) improves safety, while excessive pruning (30% or more) can reduce safety. The study further demonstrates that pruning leads to statistically significant performance improvements under domain shifts when applied to linear models. The paper also provides insights into the effects of pruning on attention patterns and perplexity, showing that pruned models are more sensitive to deviations from expected language distributions. The research concludes that WANDA pruning can improve the safety of LLMs by regularizing their behavior, making them more resistant to adversarial attacks. The study also highlights the importance of understanding the effects of pruning on different aspects of LLM performance, including safety, reasoning, and language modeling. The paper proposes that moderate WANDA pruning can be used to enhance the safety of LLMs without compromising their performance. The study also discusses the potential risks of open-sourcing datasets of jailbreaking prompts and malicious tasks, which could be used to generate more dangerous content. Overall, the research provides valuable insights into the effects of pruning on LLM safety and performance, suggesting that moderate pruning can be an effective strategy for improving the safety of large language models.