This paper introduces R³: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that leverages outcome supervision to enhance the reasoning capabilities of large language models (LLMs). The core challenge in applying RL to complex reasoning tasks is identifying a sequence of actions that yield positive rewards and providing appropriate supervision for optimization. R³ overcomes these limitations by learning from correct demonstrations, progressively sliding the start state of reasoning from a demonstration's end to its beginning. This approach establishes a step-wise curriculum, allowing outcome supervision to provide step-level signals and precisely pinpoint errors. Using Llama2-7B, R³ outperforms RL baselines on eight reasoning tasks by an average of 4.1 points, and in program-based reasoning on GSM8K, it surpasses the baseline by 4.2 points across three backbone models. Notably, Codellama-7B + R³ performs comparably to larger or closed-source models without additional data. The method is interpretable as a form of dynamic programming, reducing the search space for reasoning and facilitating more efficient exploration. Extensive experiments and ablation studies demonstrate the effectiveness and stability of R³, highlighting its potential for enhancing reasoning abilities in various tasks.This paper introduces R³: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that leverages outcome supervision to enhance the reasoning capabilities of large language models (LLMs). The core challenge in applying RL to complex reasoning tasks is identifying a sequence of actions that yield positive rewards and providing appropriate supervision for optimization. R³ overcomes these limitations by learning from correct demonstrations, progressively sliding the start state of reasoning from a demonstration's end to its beginning. This approach establishes a step-wise curriculum, allowing outcome supervision to provide step-level signals and precisely pinpoint errors. Using Llama2-7B, R³ outperforms RL baselines on eight reasoning tasks by an average of 4.1 points, and in program-based reasoning on GSM8K, it surpasses the baseline by 4.2 points across three backbone models. Notably, Codellama-7B + R³ performs comparably to larger or closed-source models without additional data. The method is interpretable as a form of dynamic programming, reducing the search space for reasoning and facilitating more efficient exploration. Extensive experiments and ablation studies demonstrate the effectiveness and stability of R³, highlighting its potential for enhancing reasoning abilities in various tasks.