8 Aug 2024 | Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck
**Lean-STaR: Learning to Interleave Thinking and Proving**
**Authors:** Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck
**Institution:** Language Technologies Institute, Carnegie Mellon University; Institute for Interdisciplinary Information Sciences, Tsinghua University
**Abstract:**
Traditional language model-based theorem proving assumes that training on formal proof data will enable models to prove theorems. However, informal information not present in formal proofs can also be useful. Lean-STaR is a framework that trains language models to produce informal thoughts before each step of a proof, enhancing theorem-proving capabilities. It uses retrospective ground-truth tactics to generate synthetic thoughts for training. At inference, the model directly generates thoughts before predicting tactics. Building on the Self-Taught Reasoner (STA R) framework, expert iteration further fine-tunes the model on correct proofs, verified using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark, significantly outperforming base models (43.4% → 46.3%, Pass@64).
**Introduction:**
Theorem proving is fundamental in mathematics and AI. Formalized mathematics provides a challenging testbed for assessing reasoning capabilities. Lean-STaR combines informal thoughts with formal proving, improving theorem-proving capabilities by interleaving natural and formal languages. It generates thought-augmented examples from Lean's Mathlib and fine-tunes a model using expert iteration.
**Method:**
Lean-STaR involves two phases: generating thought-augmented examples and fine-tuning the model. The first phase uses GPT-4 to generate thoughts based on human-written proofs. The second phase applies expert iteration to further improve the model's performance.
**Evaluation:**
Lean-STaR is evaluated on the miniF2F-test benchmark. It achieves a pass rate of 34.8% (pass@32) and 36.1% (pass@64), significantly outperforming previous state-of-the-art models. The method improves performance on various problem types and difficulties, especially in Number Theory.
**Conclusion & Limitations:**
Lean-STaR enhances theorem-proving capabilities by integrating Chain-of-Thought (CoT) rationales. It introduces the first thought-augmented theorem proving dataset and demonstrates that expert iteration can further improve performance. However, computational scalability and data costs are limitations.**Lean-STaR: Learning to Interleave Thinking and Proving**
**Authors:** Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck
**Institution:** Language Technologies Institute, Carnegie Mellon University; Institute for Interdisciplinary Information Sciences, Tsinghua University
**Abstract:**
Traditional language model-based theorem proving assumes that training on formal proof data will enable models to prove theorems. However, informal information not present in formal proofs can also be useful. Lean-STaR is a framework that trains language models to produce informal thoughts before each step of a proof, enhancing theorem-proving capabilities. It uses retrospective ground-truth tactics to generate synthetic thoughts for training. At inference, the model directly generates thoughts before predicting tactics. Building on the Self-Taught Reasoner (STA R) framework, expert iteration further fine-tunes the model on correct proofs, verified using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark, significantly outperforming base models (43.4% → 46.3%, Pass@64).
**Introduction:**
Theorem proving is fundamental in mathematics and AI. Formalized mathematics provides a challenging testbed for assessing reasoning capabilities. Lean-STaR combines informal thoughts with formal proving, improving theorem-proving capabilities by interleaving natural and formal languages. It generates thought-augmented examples from Lean's Mathlib and fine-tunes a model using expert iteration.
**Method:**
Lean-STaR involves two phases: generating thought-augmented examples and fine-tuning the model. The first phase uses GPT-4 to generate thoughts based on human-written proofs. The second phase applies expert iteration to further improve the model's performance.
**Evaluation:**
Lean-STaR is evaluated on the miniF2F-test benchmark. It achieves a pass rate of 34.8% (pass@32) and 36.1% (pass@64), significantly outperforming previous state-of-the-art models. The method improves performance on various problem types and difficulties, especially in Number Theory.
**Conclusion & Limitations:**
Lean-STaR enhances theorem-proving capabilities by integrating Chain-of-Thought (CoT) rationales. It introduces the first thought-augmented theorem proving dataset and demonstrates that expert iteration can further improve performance. However, computational scalability and data costs are limitations.