The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

2024 | Yuda Song, Gokul Swamy, Aarti Singh, J. Andrew Bagnell, Wen Sun
The paper explores the importance of online data in preference fine-tuning of large language models (LLMs), focusing on the theoretical differences between online reinforcement learning (RL) and offline contrastive methods. It introduces a new hybrid preference optimization (HyPO) algorithm that combines offline data for contrastive-based preference optimization with online unlabeled data for KL regularization. The study shows that online RL methods can outperform offline methods when the offline preference data is not diverse enough, and that a global coverage condition is necessary for offline contrastive methods to converge to the optimal policy, while a weaker partial coverage condition suffices for online RL methods. The paper also demonstrates that HyPO outperforms its pure offline counterpart DPO in terms of performance, while maintaining computational and memory efficiency. Theoretical and empirical results show that HyPO is more effective in reducing reverse KL divergence to the reference policy and improving performance on tasks such as summarization and general chat benchmarks. The paper also discusses the role of function approximation in the success of offline contrastive methods and provides a theoretical explanation for the extrapolation behavior of preference fine-tuning algorithms under the global coverage assumption.The paper explores the importance of online data in preference fine-tuning of large language models (LLMs), focusing on the theoretical differences between online reinforcement learning (RL) and offline contrastive methods. It introduces a new hybrid preference optimization (HyPO) algorithm that combines offline data for contrastive-based preference optimization with online unlabeled data for KL regularization. The study shows that online RL methods can outperform offline methods when the offline preference data is not diverse enough, and that a global coverage condition is necessary for offline contrastive methods to converge to the optimal policy, while a weaker partial coverage condition suffices for online RL methods. The paper also demonstrates that HyPO outperforms its pure offline counterpart DPO in terms of performance, while maintaining computational and memory efficiency. Theoretical and empirical results show that HyPO is more effective in reducing reverse KL divergence to the reference policy and improving performance on tasks such as summarization and general chat benchmarks. The paper also discusses the role of function approximation in the success of offline contrastive methods and provides a theoretical explanation for the extrapolation behavior of preference fine-tuning algorithms under the global coverage assumption.
Reach us at info@study.space