Token-level Direct Preference Optimization

Token-level Direct Preference Optimization

2024 | Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang
This paper introduces Token-level Direct Preference Optimization (TDPO), a novel approach to align large language models (LLMs) with human preferences by optimizing policies at the token level. Unlike previous methods, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. TDPO utilizes the Bradley-Terry model for a token-based reward system, enhancing the regulation of KL divergence while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. The code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization. TDPO maintains the simplicity of DPO while offering improved regulation of KL divergence for aligning LLMs with human preferences. The method directly optimizes the policy without necessitating explicit reward model learning or policy sampling throughout the training phase. Experimental results demonstrate the effectiveness of TDPO across multiple text tasks, and gain a notable enhancement in the quality of generated responses in comparison to both DPO and PPO-based RLHF methods. TDPO stands out for its ability to not only effectively address the issue of excessive KL divergence but also greatly improve divergence efficiency.This paper introduces Token-level Direct Preference Optimization (TDPO), a novel approach to align large language models (LLMs) with human preferences by optimizing policies at the token level. Unlike previous methods, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. TDPO utilizes the Bradley-Terry model for a token-based reward system, enhancing the regulation of KL divergence while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. The code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization. TDPO maintains the simplicity of DPO while offering improved regulation of KL divergence for aligning LLMs with human preferences. The method directly optimizes the policy without necessitating explicit reward model learning or policy sampling throughout the training phase. Experimental results demonstrate the effectiveness of TDPO across multiple text tasks, and gain a notable enhancement in the quality of generated responses in comparison to both DPO and PPO-based RLHF methods. TDPO stands out for its ability to not only effectively address the issue of excessive KL divergence but also greatly improve divergence efficiency.
Reach us at info@study.space
[slides and audio] Token-level Direct Preference Optimization