Enhancing LLM Safety via Constrained Direct Preference Optimization

Enhancing LLM Safety via Constrained Direct Preference Optimization

2024 | Zixuan Liu, Xiaolin Sun, Zizhan Zheng
This paper introduces Constrained DPO (C-DPO), a novel method for aligning large language models (LLMs) with both helpfulness and harmlessness. C-DPO extends the Direct Preference Optimization (DPO) approach by incorporating dual gradient descent to achieve a safe and efficient fine-tuning process without using reinforcement learning. The method addresses the challenge of balancing helpfulness and safety in LLMs by introducing a safety constraint during training. C-DPO uses a dual gradient descent approach to optimize the expected reward while ensuring the model adheres to a safety constraint. This approach is more efficient than traditional reinforcement learning methods and provides a stronger safety guarantee compared to DPO. The experiments show that C-DPO outperforms other methods in terms of safety and helpfulness, achieving higher rewards while maintaining safety constraints. The method is evaluated on the BEAVERTAILS dataset, and results demonstrate that C-DPO significantly improves the safety and helpfulness of LLMs.This paper introduces Constrained DPO (C-DPO), a novel method for aligning large language models (LLMs) with both helpfulness and harmlessness. C-DPO extends the Direct Preference Optimization (DPO) approach by incorporating dual gradient descent to achieve a safe and efficient fine-tuning process without using reinforcement learning. The method addresses the challenge of balancing helpfulness and safety in LLMs by introducing a safety constraint during training. C-DPO uses a dual gradient descent approach to optimize the expected reward while ensuring the model adheres to a safety constraint. This approach is more efficient than traditional reinforcement learning methods and provides a stronger safety guarantee compared to DPO. The experiments show that C-DPO outperforms other methods in terms of safety and helpfulness, achieving higher rewards while maintaining safety constraints. The method is evaluated on the BEAVERTAILS dataset, and results demonstrate that C-DPO significantly improves the safety and helpfulness of LLMs.
Reach us at info@study.space