March 8, 2024 | Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu
This paper investigates the effectiveness of various reinforcement learning (RL) algorithms in improving the reasoning capabilities of large language models (LLMs). The study compares Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned RL (RCRL) on reasoning tasks, using both sparse and dense rewards. The results show that EI performs best across all metrics, with sample efficiency comparable to PPO. The study also finds that RL training significantly improves both maj@1 and pass@96 scores, while SFT training has a trade-off between these metrics. The findings suggest that RL training can enhance LLM reasoning capabilities, with EI being particularly effective. The study also highlights the importance of exploration in RL training and the limitations of SFT training in achieving diverse solutions. Overall, the results indicate that RL can be a valuable tool for improving LLM reasoning, with EI showing strong performance and efficiency.This paper investigates the effectiveness of various reinforcement learning (RL) algorithms in improving the reasoning capabilities of large language models (LLMs). The study compares Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned RL (RCRL) on reasoning tasks, using both sparse and dense rewards. The results show that EI performs best across all metrics, with sample efficiency comparable to PPO. The study also finds that RL training significantly improves both maj@1 and pass@96 scores, while SFT training has a trade-off between these metrics. The findings suggest that RL training can enhance LLM reasoning capabilities, with EI being particularly effective. The study also highlights the importance of exploration in RL training and the limitations of SFT training in achieving diverse solutions. Overall, the results indicate that RL can be a valuable tool for improving LLM reasoning, with EI showing strong performance and efficiency.