Understanding Enhancing LLM Safety via Constrained Direct Preference Optimization

The paper introduces Constrained DPO (C-DPO), a novel approach to enhance the safety and helpfulness of large language models (LLMs) by optimizing a dual objective that balances helpfulness and harmlessness. Unlike traditional reinforcement learning from human feedback (RLHF), which is computationally expensive and unstable, C-DPO combines dual gradient descent with Direct Preference Optimization (DPO) to achieve efficient and lightweight fine-tuning. By integrating dual gradient descent, C-DPO identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirical results show that C-DPO provides a strong safety guarantee and achieves significantly higher rewards compared to other methods, including a recently proposed safe RLHF approach. The method is evaluated using the LLaMA-2-7B model, demonstrating its effectiveness in generating both high-quality and safe responses.The paper introduces Constrained DPO (C-DPO), a novel approach to enhance the safety and helpfulness of large language models (LLMs) by optimizing a dual objective that balances helpfulness and harmlessness. Unlike traditional reinforcement learning from human feedback (RLHF), which is computationally expensive and unstable, C-DPO combines dual gradient descent with Direct Preference Optimization (DPO) to achieve efficient and lightweight fine-tuning. By integrating dual gradient descent, C-DPO identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirical results show that C-DPO provides a strong safety guarantee and achieves significantly higher rewards compared to other methods, including a recently proposed safe RLHF approach. The method is evaluated using the LLaMA-2-7B model, demonstrating its effectiveness in generating both high-quality and safe responses.

ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION

4 Mar 2024 | Zixuan Liu, Xiaolin Sun, Zizhan Zheng