Understanding Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning

The paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack" by Tiansheng Huang et al. addresses the security threat posed by harmful fine-tuning on Large Language Models (LLMs). The authors propose a method called Lazy(i) Safety Alignment (Lisa) to mitigate this threat. **Background:** - **Harmful Fine-Tuning:** Recent studies show that LLMs can be compromised by fine-tuning on datasets mixed with harmful data, leading to outputs that are harmful or unsafe. - **Bi-State Optimization (BSO):** The authors initially propose BSO, which alternates between optimizing over the alignment dataset and the user fine-tuning dataset to prevent the model from forgetting alignment knowledge. However, they observe that BSO's performance degrades when the number of steps invested in the alignment state is asymmetrically low. **Key Contributions:** - **Lisa:** To address the degradation in BSO, Lisa introduces a proximal term to control the drift between the two states, ensuring the model iterates remain close to the switching checkpoints. This helps in maintaining alignment performance while reducing harmful outputs. - **Theoretical Analysis:** The authors provide theoretical guarantees for the convergence of Lisa, showing that a sufficient large proximal factor is necessary to ensure convergence. - **Empirical Results:** Experiments on various datasets and models demonstrate that Lisa significantly reduces the harmful score while maintaining or improving fine-tune accuracy. It outperforms other baselines, including Vaccine-SFT and Vlguard, in terms of both harmful score reduction and fine-tune accuracy. **Conclusion:** The paper highlights the importance of controlling the drift in fine-tuning processes to maintain the safety and effectiveness of LLMs. Lisa provides a computationally efficient solution to mitigate the risks of harmful fine-tuning, making it a valuable contribution to the field of LLM safety.The paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack" by Tiansheng Huang et al. addresses the security threat posed by harmful fine-tuning on Large Language Models (LLMs). The authors propose a method called Lazy(i) Safety Alignment (Lisa) to mitigate this threat. **Background:** - **Harmful Fine-Tuning:** Recent studies show that LLMs can be compromised by fine-tuning on datasets mixed with harmful data, leading to outputs that are harmful or unsafe. - **Bi-State Optimization (BSO):** The authors initially propose BSO, which alternates between optimizing over the alignment dataset and the user fine-tuning dataset to prevent the model from forgetting alignment knowledge. However, they observe that BSO's performance degrades when the number of steps invested in the alignment state is asymmetrically low. **Key Contributions:** - **Lisa:** To address the degradation in BSO, Lisa introduces a proximal term to control the drift between the two states, ensuring the model iterates remain close to the switching checkpoints. This helps in maintaining alignment performance while reducing harmful outputs. - **Theoretical Analysis:** The authors provide theoretical guarantees for the convergence of Lisa, showing that a sufficient large proximal factor is necessary to ensure convergence. - **Empirical Results:** Experiments on various datasets and models demonstrate that Lisa significantly reduces the harmful score while maintaining or improving fine-tune accuracy. It outperforms other baselines, including Vaccine-SFT and Vlguard, in terms of both harmful score reduction and fine-tune accuracy. **Conclusion:** The paper highlights the importance of controlling the drift in fine-tuning processes to maintain the safety and effectiveness of LLMs. Lisa provides a computationally efficient solution to mitigate the risks of harmful fine-tuning, making it a valuable contribution to the field of LLM safety.

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

29 Oct 2024 | Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu