Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

8 Jun 2024 | Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, Bowen Zhou
This paper proposes Online Fast-Slow Chasing DPO (OFS-DPO) and Cross-domain Online Fast-Slow Chasing DPO (COFS-DPO) for online direct preference optimization. OFS-DPO is designed to enhance the adaptability of large language models (LLMs) to continuously evolving data by simulating intraspecific competition through fast and slow modules. The method introduces a regularization term to measure and guide the preference probability gap between the modules, enabling efficient optimization. COFS-DPO extends OFS-DPO to cross-domain scenarios by leveraging linear combinations of fast modules from different task domains, allowing the model to retain historical information and achieve continual value alignment. Theoretical analysis shows that OFS-DPO achieves a lower empirical regret bound, supported by more stable gradient optimization and faster convergence. Experimental results demonstrate that OFS-DPO outperforms DPO in in-domain alignment, while COFS-DPO excels in cross-domain continual learning scenarios. The proposed methods provide new insights and solutions for online human preference alignment tasks and have potential for broad applicability across multiple domains.This paper proposes Online Fast-Slow Chasing DPO (OFS-DPO) and Cross-domain Online Fast-Slow Chasing DPO (COFS-DPO) for online direct preference optimization. OFS-DPO is designed to enhance the adaptability of large language models (LLMs) to continuously evolving data by simulating intraspecific competition through fast and slow modules. The method introduces a regularization term to measure and guide the preference probability gap between the modules, enabling efficient optimization. COFS-DPO extends OFS-DPO to cross-domain scenarios by leveraging linear combinations of fast modules from different task domains, allowing the model to retain historical information and achieve continual value alignment. Theoretical analysis shows that OFS-DPO achieves a lower empirical regret bound, supported by more stable gradient optimization and faster convergence. Experimental results demonstrate that OFS-DPO outperforms DPO in in-domain alignment, while COFS-DPO excels in cross-domain continual learning scenarios. The proposed methods provide new insights and solutions for online human preference alignment tasks and have potential for broad applicability across multiple domains.
Reach us at info@study.space
[slides] Online DPO%3A Online Direct Preference Optimization with Fast-Slow Chasing | StudySpace