[slides and audio] Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

The paper "Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching" addresses the challenges of safety alignment in large language models (LLMs), particularly the fragile and imbalanced safety mechanisms that can still generate unsafe responses or exhibit over-safety. To tackle these issues, the authors propose a novel post-safety alignment (PSA) method called SAFEPATCHING. SAFEPATCHING aims to enhance safety, mitigate over-safety, and preserve utility in LLMs. The method involves two stages: Patch Derivation (PD) and Controllable Patching (CP). In PD, two distinct safety patches are developed using harmful data: one for enhancing safety and the other for mitigating over-safety. These patches are then integrated into the target LLM backbone in CP, ensuring seamless integration without compromising utility. Extensive experiments on four representative LLMs (LLaMA-2/3, Gemma, and Mistral) demonstrate that SAFEPATCHING achieves a more comprehensive PSA compared to baseline methods, optimizing the balance between helpfulness and harmlessness. The paper also highlights the effectiveness of SAFEPATCHING in continual PSA scenarios, making it a valuable contribution to the field of LLM safety alignment.The paper "Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching" addresses the challenges of safety alignment in large language models (LLMs), particularly the fragile and imbalanced safety mechanisms that can still generate unsafe responses or exhibit over-safety. To tackle these issues, the authors propose a novel post-safety alignment (PSA) method called SAFEPATCHING. SAFEPATCHING aims to enhance safety, mitigate over-safety, and preserve utility in LLMs. The method involves two stages: Patch Derivation (PD) and Controllable Patching (CP). In PD, two distinct safety patches are developed using harmful data: one for enhancing safety and the other for mitigating over-safety. These patches are then integrated into the target LLM backbone in CP, ensuring seamless integration without compromising utility. Extensive experiments on four representative LLMs (LLaMA-2/3, Gemma, and Mistral) demonstrate that SAFEPATCHING achieves a more comprehensive PSA compared to baseline methods, optimizing the balance between helpfulness and harmlessness. The paper also highlights the effectiveness of SAFEPATCHING in continual PSA scenarios, making it a valuable contribution to the field of LLM safety alignment.

Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

17 Dec 2024 | Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu