Understanding Teaching Large Language Models to Reason with Reinforcement Learning

This paper investigates the performance of various reinforcement learning (RL) algorithms in improving the reasoning capabilities of large language models (LLMs). The study compares Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned RL (RCRL) on reasoning tasks, using both sparse and dense rewards. The experiments are conducted on models of different sizes and initializations, including those with and without supervised fine-tuning (SFT) data. The results show that EI generally performs the best, with competitive sample complexity compared to PPO. Surprisingly, EI requires only a few thousand samples to converge, similar to PPO. The study also finds that the gap between pretrained and SFT models is significantly reduced after RL fine-tuning, with larger models showing a smaller gap. Additionally, RL fine-tuning improves both maj@1 and pass@96 scores simultaneously, unlike SFT, which primarily improves maj@1 accuracy. The authors attribute this to the diverse nature of the exploration data generated during RL training. The paper concludes by discussing the implications of these findings for RLHF and the future role of RL in LLM fine-tuning, emphasizing the need for more sophisticated exploration techniques to enhance LLM reasoning capabilities.This paper investigates the performance of various reinforcement learning (RL) algorithms in improving the reasoning capabilities of large language models (LLMs). The study compares Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned RL (RCRL) on reasoning tasks, using both sparse and dense rewards. The experiments are conducted on models of different sizes and initializations, including those with and without supervised fine-tuning (SFT) data. The results show that EI generally performs the best, with competitive sample complexity compared to PPO. Surprisingly, EI requires only a few thousand samples to converge, similar to PPO. The study also finds that the gap between pretrained and SFT models is significantly reduced after RL fine-tuning, with larger models showing a smaller gap. Additionally, RL fine-tuning improves both maj@1 and pass@96 scores simultaneously, unlike SFT, which primarily improves maj@1 accuracy. The authors attribute this to the diverse nature of the exploration data generated during RL training. The paper concludes by discussing the implications of these findings for RLHF and the future role of RL in LLM fine-tuning, emphasizing the need for more sophisticated exploration techniques to enhance LLM reasoning capabilities.

Teaching Large Language Models to Reason with Reinforcement Learning

March 8, 2024 | Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu