Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

17 Mar 2024 | Zhiheng Xi, Wenxiang Chen, Boyang Hong, Senjie Jin, Rui Zheng, Wei He, Yiwen Ding, Shichun Liu, Xin Guo, Junzhe Wang, Honglin Guo, Wei Shen, Xiaoran Fan, Yuhao Zhou, Shihan Dou, Xiao Wang, Xinbo Zhang, Peng Sun, Tao Gui, Qi Zhang, Xuanjing Huang
This paper introduces R³: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that leverages outcome supervision to enhance the reasoning capabilities of large language models (LLMs). The core challenge in applying RL to complex reasoning tasks is identifying a sequence of actions that yield positive rewards and providing appropriate supervision for optimization. R³ overcomes these limitations by learning from correct demonstrations, progressively sliding the start state of reasoning from a demonstration's end to its beginning. This approach establishes a step-wise curriculum, allowing outcome supervision to provide step-level signals and precisely pinpoint errors. Using Llama2-7B, R³ outperforms RL baselines on eight reasoning tasks by an average of 4.1 points, and in program-based reasoning on GSM8K, it surpasses the baseline by 4.2 points across three backbone models. Notably, Codellama-7B + R³ performs comparably to larger or closed-source models without additional data. The method is interpretable as a form of dynamic programming, reducing the search space for reasoning and facilitating more efficient exploration. Extensive experiments and ablation studies demonstrate the effectiveness and stability of R³, highlighting its potential for enhancing reasoning abilities in various tasks.This paper introduces R³: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that leverages outcome supervision to enhance the reasoning capabilities of large language models (LLMs). The core challenge in applying RL to complex reasoning tasks is identifying a sequence of actions that yield positive rewards and providing appropriate supervision for optimization. R³ overcomes these limitations by learning from correct demonstrations, progressively sliding the start state of reasoning from a demonstration's end to its beginning. This approach establishes a step-wise curriculum, allowing outcome supervision to provide step-level signals and precisely pinpoint errors. Using Llama2-7B, R³ outperforms RL baselines on eight reasoning tasks by an average of 4.1 points, and in program-based reasoning on GSM8K, it surpasses the baseline by 4.2 points across three backbone models. Notably, Codellama-7B + R³ performs comparably to larger or closed-source models without additional data. The method is interpretable as a form of dynamic programming, reducing the search space for reasoning and facilitating more efficient exploration. Extensive experiments and ablation studies demonstrate the effectiveness and stability of R³, highlighting its potential for enhancing reasoning abilities in various tasks.
Reach us at info@study.space