2 Jun 2024 | Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyng Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar
Preference fine-tuning of large language models (LLMs) should leverage suboptimal, on-policy data. This paper investigates the effectiveness of various preference-based fine-tuning methods, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. The study reveals that methods using on-policy sampling or negative gradients outperform offline and maximum likelihood objectives. These methods, termed "mode-seeking," can rapidly adjust probability mass in categorical distributions, leading to better performance. The analysis shows that on-policy sampling and negative gradients are particularly beneficial when high-reward responses are in less-likely regions of the reference policy. The paper also highlights the importance of data coverage and geometric relationships in preference fine-tuning. It provides actionable insights for practitioners, emphasizing the trade-offs between on-policy sampling and gradient steps, and the role of negative gradients in improving performance. The study uses didactic bandit problems, synthetic LLM problems, and full-scale LLM problems to validate these findings, demonstrating that on-policy methods and negative gradients can lead to better results in scenarios where the reward function's peak is in less-likely regions of the reference policy. The results show that on-policy sampling and negative gradients are crucial for effective preference fine-tuning, especially when the reward function's peak is far from the reference policy.Preference fine-tuning of large language models (LLMs) should leverage suboptimal, on-policy data. This paper investigates the effectiveness of various preference-based fine-tuning methods, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. The study reveals that methods using on-policy sampling or negative gradients outperform offline and maximum likelihood objectives. These methods, termed "mode-seeking," can rapidly adjust probability mass in categorical distributions, leading to better performance. The analysis shows that on-policy sampling and negative gradients are particularly beneficial when high-reward responses are in less-likely regions of the reference policy. The paper also highlights the importance of data coverage and geometric relationships in preference fine-tuning. It provides actionable insights for practitioners, emphasizing the trade-offs between on-policy sampling and gradient steps, and the role of negative gradients in improving performance. The study uses didactic bandit problems, synthetic LLM problems, and full-scale LLM problems to validate these findings, demonstrating that on-policy methods and negative gradients can lead to better results in scenarios where the reward function's peak is in less-likely regions of the reference policy. The results show that on-policy sampling and negative gradients are crucial for effective preference fine-tuning, especially when the reward function's peak is far from the reference policy.