14 May 2024 | Yunhao Tang, Daniel Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng and Will Dabney
This paper investigates the performance gap between online and offline alignment algorithms in the context of reinforcement learning from human feedback (RLHF). The study shows that online algorithms generally outperform offline algorithms in terms of policy performance, even when using the same optimization budget measured by KL divergence from the SFT policy. This performance advantage is consistent across multiple open-source datasets and is not explained by factors such as data coverage or data quality. The results suggest that online algorithms are a Pareto improvement over offline algorithms, highlighting the importance of on-policy sampling in AI alignment.
The study also reveals an intriguing interplay between discriminative and generative capabilities of policies. While offline policies are better at classification, they perform worse in generation, whereas online policies excel in generation but are less effective in classification. This discrepancy is attributed to the different sampling processes used in online and offline algorithms. Additionally, the performance gap persists for both contrastive and non-contrastive loss functions, and scaling up policy networks does not resolve the issue.
The paper presents several hypotheses to explain the performance gap, including data coverage, sub-optimal offline datasets, better classification leading to better performance, and the impact of loss functions. However, empirical evidence refutes these hypotheses, showing that data coverage alone cannot explain the gap, and that offline policies trained on high-quality data do not necessarily perform better than online policies.
The study also highlights the importance of on-policy sampling in AI alignment and suggests that offline alignment algorithms face fundamental challenges. The findings provide insights into the performance differences between online and offline algorithms and contribute to the understanding of the role of on-policy sampling in AI alignment. The results suggest that online algorithms are more effective in achieving high performance, and that offline algorithms require further improvements to match their performance. The study underscores the need for further research into the fundamental challenges of offline alignment algorithms and the potential benefits of on-policy sampling in AI alignment.This paper investigates the performance gap between online and offline alignment algorithms in the context of reinforcement learning from human feedback (RLHF). The study shows that online algorithms generally outperform offline algorithms in terms of policy performance, even when using the same optimization budget measured by KL divergence from the SFT policy. This performance advantage is consistent across multiple open-source datasets and is not explained by factors such as data coverage or data quality. The results suggest that online algorithms are a Pareto improvement over offline algorithms, highlighting the importance of on-policy sampling in AI alignment.
The study also reveals an intriguing interplay between discriminative and generative capabilities of policies. While offline policies are better at classification, they perform worse in generation, whereas online policies excel in generation but are less effective in classification. This discrepancy is attributed to the different sampling processes used in online and offline algorithms. Additionally, the performance gap persists for both contrastive and non-contrastive loss functions, and scaling up policy networks does not resolve the issue.
The paper presents several hypotheses to explain the performance gap, including data coverage, sub-optimal offline datasets, better classification leading to better performance, and the impact of loss functions. However, empirical evidence refutes these hypotheses, showing that data coverage alone cannot explain the gap, and that offline policies trained on high-quality data do not necessarily perform better than online policies.
The study also highlights the importance of on-policy sampling in AI alignment and suggests that offline alignment algorithms face fundamental challenges. The findings provide insights into the performance differences between online and offline algorithms and contribute to the understanding of the role of on-policy sampling in AI alignment. The results suggest that online algorithms are more effective in achieving high performance, and that offline algorithms require further improvements to match their performance. The study underscores the need for further research into the fundamental challenges of offline alignment algorithms and the potential benefits of on-policy sampling in AI alignment.