Bootstrapping Language Models with DPO Implicit Rewards

Bootstrapping Language Models with DPO Implicit Rewards

14 Jun 2024 | Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin
This paper introduces DICE, a novel approach that leverages the implicit reward model from DPO to further align large language models (LLMs) with human preferences. The key idea is to use the implicit reward model generated by DPO to bootstrap the alignment process. The method involves using the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. Refinements are incorporated to debias the length of responses and improve the quality of the preference dataset. The approach, named self-alignment with DPO Implicit Rewards (DICE), shows significant improvements in alignment and achieves superior performance than Gemini Pro on AlpacaEval 2, reaching a 27.55% length-controlled win rate against GPT-4 Turbo with only 8B parameters and no external feedback. The paper explores the use of implicit rewards in a bootstrapping fashion to further improve LLM alignment. It addresses issues such as length exploitation and overreliance on implicit rewards by introducing length-regularized reward shaping and experience replay. The method is shown to significantly improve LLM alignment quality with different base models, achieving 8.02% and 9.35% improvements with Zephyr-based and Llama3-based models, respectively. The best model outperforms Gemini Pro with only 8B parameters and without any in-house data or external reward model. The paper also discusses related work, including self-improving fine-tuning and on-policy sampling in preference tuning. It highlights the advantages of DPO implicit rewards and their application in improving LLM alignment. The experiments demonstrate that DICE effectively improves a DPO-ed model and achieves better performance than other baselines. The method is compatible with other direct preference optimization algorithms and shows promising results in various settings. The paper concludes that DICE is a practical and effective approach for aligning LLMs with human preferences, with potential broader impacts and future research directions.This paper introduces DICE, a novel approach that leverages the implicit reward model from DPO to further align large language models (LLMs) with human preferences. The key idea is to use the implicit reward model generated by DPO to bootstrap the alignment process. The method involves using the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. Refinements are incorporated to debias the length of responses and improve the quality of the preference dataset. The approach, named self-alignment with DPO Implicit Rewards (DICE), shows significant improvements in alignment and achieves superior performance than Gemini Pro on AlpacaEval 2, reaching a 27.55% length-controlled win rate against GPT-4 Turbo with only 8B parameters and no external feedback. The paper explores the use of implicit rewards in a bootstrapping fashion to further improve LLM alignment. It addresses issues such as length exploitation and overreliance on implicit rewards by introducing length-regularized reward shaping and experience replay. The method is shown to significantly improve LLM alignment quality with different base models, achieving 8.02% and 9.35% improvements with Zephyr-based and Llama3-based models, respectively. The best model outperforms Gemini Pro with only 8B parameters and without any in-house data or external reward model. The paper also discusses related work, including self-improving fine-tuning and on-policy sampling in preference tuning. It highlights the advantages of DPO implicit rewards and their application in improving LLM alignment. The experiments demonstrate that DICE effectively improves a DPO-ed model and achieves better performance than other baselines. The method is compatible with other direct preference optimization algorithms and shows promising results in various settings. The paper concludes that DICE is a practical and effective approach for aligning LLMs with human preferences, with potential broader impacts and future research directions.
Reach us at info@study.space
[slides and audio] Bootstrapping Language Models with DPO Implicit Rewards