16 May 2024 | Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, Minjoon Seo
The paper "SELF-EXPLORE to Avoid the PIT: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards" addresses the challenge of enhancing the reasoning capabilities of large language models (LLMs) through self-training. The authors propose a method called SELF-EXPLORE, which enables LLMs to identify and learn from their own generated rationales, specifically focusing on the first wrong step (or "first pit") in a Chain-of-Thought (CoT) rationale. By constructing a pairwise dataset and applying preference learning techniques, SELF-EXPLORE improves the model's ability to generate correct solutions step-by-step. The method is evaluated on the GSM8K and MATH datasets, showing significant improvements over supervised fine-tuning (SFT) across three different LLMs. The paper also discusses the effectiveness of step-level supervision and the impact of different preference learning objectives, demonstrating that fine-grained supervision consistently outperforms outcome-supervised methods. The authors conclude by highlighting the potential of SELF-EXPLORE in advancing LLM reasoning capabilities and suggest future directions for improving self-training methods.The paper "SELF-EXPLORE to Avoid the PIT: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards" addresses the challenge of enhancing the reasoning capabilities of large language models (LLMs) through self-training. The authors propose a method called SELF-EXPLORE, which enables LLMs to identify and learn from their own generated rationales, specifically focusing on the first wrong step (or "first pit") in a Chain-of-Thought (CoT) rationale. By constructing a pairwise dataset and applying preference learning techniques, SELF-EXPLORE improves the model's ability to generate correct solutions step-by-step. The method is evaluated on the GSM8K and MATH datasets, showing significant improvements over supervised fine-tuning (SFT) across three different LLMs. The paper also discusses the effectiveness of step-level supervision and the impact of different preference learning objectives, demonstrating that fine-grained supervision consistently outperforms outcome-supervised methods. The authors conclude by highlighting the potential of SELF-EXPLORE in advancing LLM reasoning capabilities and suggest future directions for improving self-training methods.