Understanding Online DPO%3A Online Direct Preference Optimization with Fast-Slow Chasing

The paper introduces Online Fast-Slow Chasing Direct Preference Optimization (OFS-DPO) to improve the alignment of large language models (LLMs) with human values by training directly on human preference datasets. Inspired by intraspecific competition, OFS-DPO simulates competition through fast and slow modules to facilitate rapid adaptation. The method includes a regularization term to guide the learning of these modules. To extend OFS-DPO to cross-domain scenarios, the authors propose Cross-domain Online Fast-Slow Chasing DPO (COFS-DPO), which combines linear combinations of fast modules from different task domains to achieve better performance. Theoretical analysis and experimental results show that OFS-DPO outperforms traditional DPO in in-domain tasks, while COFS-DPO excels in cross-domain continual learning scenarios. The contributions of the paper include the introduction of OFS-DPO and COFS-DPO, along with their theoretical guarantees and experimental validation.The paper introduces Online Fast-Slow Chasing Direct Preference Optimization (OFS-DPO) to improve the alignment of large language models (LLMs) with human values by training directly on human preference datasets. Inspired by intraspecific competition, OFS-DPO simulates competition through fast and slow modules to facilitate rapid adaptation. The method includes a regularization term to guide the learning of these modules. To extend OFS-DPO to cross-domain scenarios, the authors propose Cross-domain Online Fast-Slow Chasing DPO (COFS-DPO), which combines linear combinations of fast modules from different task domains to achieve better performance. Theoretical analysis and experimental results show that OFS-DPO outperforms traditional DPO in in-domain tasks, while COFS-DPO excels in cross-domain continual learning scenarios. The contributions of the paper include the introduction of OFS-DPO and COFS-DPO, along with their theoretical guarantees and experimental validation.

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

8 Jun 2024 | Biqing Qi1,2, Pengfei Li3, Fangyuan Li1, Junqi Gao3, Kaiyan Zhang2, Bowen Zhou2,

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

8 Jun 2024 | Biqing Qi1,2, Pengfei Li3, Fangyuan Li1, Junqi Gao3*, Kaiyan Zhang2, Bowen Zhou2,*

8 Jun 2024 | Biqing Qi1,2, Pengfei Li3, Fangyuan Li1, Junqi Gao3, Kaiyan Zhang2, Bowen Zhou2,