17 Dec 2024 | Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
This paper proposes SAFEPATCHING, a novel post safety alignment (PSA) method for large language models (LLMs) to address safety enhancement, over-safety mitigation, and utility preservation. Current safety-aligned LLMs suffer from fragile and imbalanced safety mechanisms, leading to unsafe responses, over-safety, and loss of general utility. SAFEPATCHING introduces two distinct safety patches derived from harmful data to enhance safety and mitigate over-safety, then seamlessly integrates them into the target LLM backbone without compromising its utility. Extensive experiments on four representative LLMs (LLaMA-2/3, Gemma, and Mistral) show that SAFEPATCHING achieves more comprehensive PSA than baseline methods, further optimizing the balance between being helpful and harmless. It also demonstrates superiority in continual PSA scenarios. The method involves two stages: Patch Derivation (PD) and Controllable Patching (CP). In PD, gradient ascent and descent are used to derive safety and over-safety patches. In CP, random parameter retention and controlled merging are applied to minimize conflicts between patches and the backbone. The results show that SAFEPATCHING effectively addresses the three goals of PSA, achieving a good balance between safety, over-safety mitigation, and utility preservation. The method is also effective in continual PSA settings. The paper highlights the importance of using safety-specific controls to prevent conflicts during the patching process. The main contributions include being the first to simultaneously study the three goals of PSA for aligned LLMs, proposing SAFEPATCHING to seamlessly integrate safe patches into the target LLM backbone, and demonstrating the effectiveness and efficiency of SAFEPATCHING in achieving comprehensive PSA. The paper also discusses related works, including post safety alignment for LLMs and model merging, and provides an analysis of the method's performance on safety, over-safety, and utility benchmarks. The results show that SAFEPATCHING outperforms baseline methods in comprehensive and effective PSA, underscoring its scalability. The paper also discusses limitations and ethical considerations, emphasizing the importance of ensuring safety and usability in LLMs.This paper proposes SAFEPATCHING, a novel post safety alignment (PSA) method for large language models (LLMs) to address safety enhancement, over-safety mitigation, and utility preservation. Current safety-aligned LLMs suffer from fragile and imbalanced safety mechanisms, leading to unsafe responses, over-safety, and loss of general utility. SAFEPATCHING introduces two distinct safety patches derived from harmful data to enhance safety and mitigate over-safety, then seamlessly integrates them into the target LLM backbone without compromising its utility. Extensive experiments on four representative LLMs (LLaMA-2/3, Gemma, and Mistral) show that SAFEPATCHING achieves more comprehensive PSA than baseline methods, further optimizing the balance between being helpful and harmless. It also demonstrates superiority in continual PSA scenarios. The method involves two stages: Patch Derivation (PD) and Controllable Patching (CP). In PD, gradient ascent and descent are used to derive safety and over-safety patches. In CP, random parameter retention and controlled merging are applied to minimize conflicts between patches and the backbone. The results show that SAFEPATCHING effectively addresses the three goals of PSA, achieving a good balance between safety, over-safety mitigation, and utility preservation. The method is also effective in continual PSA settings. The paper highlights the importance of using safety-specific controls to prevent conflicts during the patching process. The main contributions include being the first to simultaneously study the three goals of PSA for aligned LLMs, proposing SAFEPATCHING to seamlessly integrate safe patches into the target LLM backbone, and demonstrating the effectiveness and efficiency of SAFEPATCHING in achieving comprehensive PSA. The paper also discusses related works, including post safety alignment for LLMs and model merging, and provides an analysis of the method's performance on safety, over-safety, and utility benchmarks. The results show that SAFEPATCHING outperforms baseline methods in comprehensive and effective PSA, underscoring its scalability. The paper also discusses limitations and ethical considerations, emphasizing the importance of ensuring safety and usability in LLMs.