Bootstrapping Language Models with DPO Implicit Rewards

Bootstrapping Language Models with DPO Implicit Rewards

14 Jun 2024 | Changyu Chen*1, Zichen Liu*23, Chao Du12, Tianyu Pang2, Qian Liu2, Arunesh Sinha14, Pradeep Varakantham11, Min Lin2
This paper introduces DICE (self-alignment with DPO Implicit REwards), a novel approach to align large language models (LLMs) with human preferences using the implicit reward model obtained from Direct Preference Optimization (DPO). DICE leverages the implicit reward model to construct a preference dataset iteratively, improving the LLM's alignment without external feedback. The method addresses issues such as length exploitation and overreliance on the implicit reward model through length-regularized reward shaping and experience replay. Empirical results show that DICE significantly enhances LLM alignment, achieving superior performance on AlpacaEval 2.0 compared to Gemini Pro with only 8B parameters, without requiring additional human annotations or external reward models. The approach is practical and efficient, making it a promising method for improving LLMs' alignment with human preferences.This paper introduces DICE (self-alignment with DPO Implicit REwards), a novel approach to align large language models (LLMs) with human preferences using the implicit reward model obtained from Direct Preference Optimization (DPO). DICE leverages the implicit reward model to construct a preference dataset iteratively, improving the LLM's alignment without external feedback. The method addresses issues such as length exploitation and overreliance on the implicit reward model through length-regularized reward shaping and experience replay. Empirical results show that DICE significantly enhances LLM alignment, achieving superior performance on AlpacaEval 2.0 compared to Gemini Pro with only 8B parameters, without requiring additional human annotations or external reward models. The approach is practical and efficient, making it a promising method for improving LLMs' alignment with human preferences.
Reach us at info@study.space