Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

29 Oct 2024 | Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack This paper proposes Lisa, a novel method for mitigating harmful fine-tuning attacks on large language models (LLMs). The authors first introduce a Bi-State Optimization (BSO) approach that alternates between optimizing over alignment and user fine-tuning datasets. However, they find that BSO can suffer from convergence instability when the steps invested in the alignment state are too small, leading to degraded alignment performance. To address this, they propose Lisa, which introduces a proximal term to constrain the drift of each state. Theoretically, the proximal term is supported by convergence analysis, showing that a sufficient large proximal factor is necessary for Lisa's convergence. Empirically, Lisa significantly increases alignment performance while maintaining the LLM's accuracy on user tasks. The paper shows that Lisa outperforms vanilla BSO by reducing harmful scores by up to 6.54% while maintaining the same level of fine-tuning accuracy. Lisa is evaluated on four downstream fine-tuning tasks and demonstrates superior performance compared to existing methods. The authors also show that Lisa is effective in various scenarios, including different step allocations, proximal intensities, and datasets. They find that Lisa is particularly effective when the alignment state is under-resourced, as it can maintain alignment performance despite the imbalance. The paper also discusses the limitations of Lisa, including extra overhead and weak extension to RLHF. The authors conclude that Lisa is a promising solution for mitigating harmful fine-tuning attacks on LLMs, but further research is needed to address its limitations.Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack This paper proposes Lisa, a novel method for mitigating harmful fine-tuning attacks on large language models (LLMs). The authors first introduce a Bi-State Optimization (BSO) approach that alternates between optimizing over alignment and user fine-tuning datasets. However, they find that BSO can suffer from convergence instability when the steps invested in the alignment state are too small, leading to degraded alignment performance. To address this, they propose Lisa, which introduces a proximal term to constrain the drift of each state. Theoretically, the proximal term is supported by convergence analysis, showing that a sufficient large proximal factor is necessary for Lisa's convergence. Empirically, Lisa significantly increases alignment performance while maintaining the LLM's accuracy on user tasks. The paper shows that Lisa outperforms vanilla BSO by reducing harmful scores by up to 6.54% while maintaining the same level of fine-tuning accuracy. Lisa is evaluated on four downstream fine-tuning tasks and demonstrates superior performance compared to existing methods. The authors also show that Lisa is effective in various scenarios, including different step allocations, proximal intensities, and datasets. They find that Lisa is particularly effective when the alignment state is under-resourced, as it can maintain alignment performance despite the imbalance. The paper also discusses the limitations of Lisa, including extra overhead and weak extension to RLHF. The authors conclude that Lisa is a promising solution for mitigating harmful fine-tuning attacks on LLMs, but further research is needed to address its limitations.
Reach us at info@study.space
[slides and audio] Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning