2024 | Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang
Token-level Direct Preference Optimization (TDPO) is a novel approach to align large language models (LLMs) with human preferences by optimizing policy at the token level. Unlike previous methods like Direct Preference Optimization (DPO), which focus on sentence-level divergence, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. The method uses the Bradley-Terry model for a token-based reward system, enhancing the regulation of KL divergence while maintaining simplicity. Experimental results across various text tasks, including sentiment generation and single-turn dialogue datasets, demonstrate that TDPO outperforms DPO and PPO-based RLHF methods in balancing alignment with generation diversity. The code for TDPO is open-sourced at <https://github.com/Vance0124/Token-level-Direct-Preference-Optimization>.Token-level Direct Preference Optimization (TDPO) is a novel approach to align large language models (LLMs) with human preferences by optimizing policy at the token level. Unlike previous methods like Direct Preference Optimization (DPO), which focus on sentence-level divergence, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. The method uses the Bradley-Terry model for a token-based reward system, enhancing the regulation of KL divergence while maintaining simplicity. Experimental results across various text tasks, including sentiment generation and single-turn dialogue datasets, demonstrate that TDPO outperforms DPO and PPO-based RLHF methods in balancing alignment with generation diversity. The code for TDPO is open-sourced at <https://github.com/Vance0124/Token-level-Direct-Preference-Optimization>.