Understanding Preference Fine-Tuning of LLMs Should Leverage Suboptimal%2C On-Policy Data

The paper "Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data" by Fahim Tajwar et al. explores the effectiveness of different fine-tuning techniques for large language models (LLMs) when optimizing for binary preferences. The authors investigate various methods, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning, to understand which approaches are most effective and why. They find that methods that use on-policy sampling or employ a "negative gradient" (i.e., push down the likelihood of certain responses) generally outperform offline and maximum likelihood objectives. The key insight is that mode-seeking objectives, which alter probability mass on specific bins of a categorical distribution at a faster rate compared to maximum likelihood, are more effective in relocating probability mass across bins. The paper provides actionable insights for practitioners, highlighting the importance of on-policy sampling and negative gradients, especially when high-reward responses are in less-likely regions of the reference policy distribution. The authors also discuss the trade-offs between on-policy sampling and sample reuse, and the impact of geometric conditions and data coverage on the performance of different fine-tuning methods. Overall, the study aims to clarify the best practices for preference fine-tuning of LLMs and inform data collection strategies for maximal improvement.The paper "Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data" by Fahim Tajwar et al. explores the effectiveness of different fine-tuning techniques for large language models (LLMs) when optimizing for binary preferences. The authors investigate various methods, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning, to understand which approaches are most effective and why. They find that methods that use on-policy sampling or employ a "negative gradient" (i.e., push down the likelihood of certain responses) generally outperform offline and maximum likelihood objectives. The key insight is that mode-seeking objectives, which alter probability mass on specific bins of a categorical distribution at a faster rate compared to maximum likelihood, are more effective in relocating probability mass across bins. The paper provides actionable insights for practitioners, highlighting the importance of on-policy sampling and negative gradients, especially when high-reward responses are in less-likely regions of the reference policy distribution. The authors also discuss the trade-offs between on-policy sampling and sample reuse, and the impact of geometric conditions and data coverage on the performance of different fine-tuning methods. Overall, the study aims to clarify the best practices for preference fine-tuning of LLMs and inform data collection strategies for maximal improvement.

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

2 Jun 2024 | Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn and Aviral Kumar