The paper introduces SELF-EXPLORE, a method to enhance the reasoning capabilities of large language models (LLMs) by enabling self-improvement through fine-grained rewards. The core idea is to identify the first incorrect step (the "first pit") in a rationale and use this information as a reward signal to guide further improvements. This approach is tested on the GSM8K and MATH datasets, achieving significant improvements over supervised fine-tuning (SFT) across three different LLMs. SELF-EXPLORE outperforms other methods by focusing on step-level supervision, identifying the first pit in the reasoning process, and using this information to refine the model's performance. The method involves creating a granular pairwise dataset based on step-level exploration, which is then used for preference learning. The results show that SELF-EXPLORE consistently improves performance on both GSM8K and MATH, with improvements of up to 11.57% and 2.89%, respectively. The paper also discusses the limitations of the approach, including the potential for overfitting and the need for further research on integrating diverse datasets to enhance generalization. Overall, SELF-EXPLORE demonstrates the effectiveness of self-training and fine-grained reward signals in improving the reasoning capabilities of LLMs.The paper introduces SELF-EXPLORE, a method to enhance the reasoning capabilities of large language models (LLMs) by enabling self-improvement through fine-grained rewards. The core idea is to identify the first incorrect step (the "first pit") in a rationale and use this information as a reward signal to guide further improvements. This approach is tested on the GSM8K and MATH datasets, achieving significant improvements over supervised fine-tuning (SFT) across three different LLMs. SELF-EXPLORE outperforms other methods by focusing on step-level supervision, identifying the first pit in the reasoning process, and using this information to refine the model's performance. The method involves creating a granular pairwise dataset based on step-level exploration, which is then used for preference learning. The results show that SELF-EXPLORE consistently improves performance on both GSM8K and MATH, with improvements of up to 11.57% and 2.89%, respectively. The paper also discusses the limitations of the approach, including the potential for overfitting and the need for further research on integrating diverse datasets to enhance generalization. Overall, SELF-EXPLORE demonstrates the effectiveness of self-training and fine-grained reward signals in improving the reasoning capabilities of LLMs.