The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

16 Jul 2024 | Yuda Song, Gokul Swamy, Aarti Singh, J. Andrew Bagnell, Wen Sun
The paper explores the differences and similarities between online reinforcement learning (RL) and offline contrastive methods in fine-tuning large language models (LLMs) based on human preference data. The authors introduce the concept of *dataset coverage* to analyze the performance of these methods. They prove that offline contrastive methods require a global coverage condition to converge to the optimal policy, while online RL methods only need a weaker partial coverage condition. This separation explains why online RL methods often outperform offline methods, especially when the offline preference data is insufficient. To address the limitations of offline contrastive methods, the authors propose a hybrid preference optimization (HyPO) algorithm that combines offline data for contrastive optimization and online unlabeled data for KL regularization. Theoretical and empirical results show that HyPO outperforms pure offline methods like DPO while maintaining computational efficiency. The paper also discusses the importance of function approximation in the success of offline contrastive methods and provides a theoretical explanation for the extrapolation behavior of these algorithms under global coverage assumptions.The paper explores the differences and similarities between online reinforcement learning (RL) and offline contrastive methods in fine-tuning large language models (LLMs) based on human preference data. The authors introduce the concept of *dataset coverage* to analyze the performance of these methods. They prove that offline contrastive methods require a global coverage condition to converge to the optimal policy, while online RL methods only need a weaker partial coverage condition. This separation explains why online RL methods often outperform offline methods, especially when the offline preference data is insufficient. To address the limitations of offline contrastive methods, the authors propose a hybrid preference optimization (HyPO) algorithm that combines offline data for contrastive optimization and online unlabeled data for KL regularization. Theoretical and empirical results show that HyPO outperforms pure offline methods like DPO while maintaining computational efficiency. The paper also discusses the importance of function approximation in the success of offline contrastive methods and provides a theoretical explanation for the extrapolation behavior of these algorithms under global coverage assumptions.
Reach us at info@study.space