Understanding Lean-STaR%3A Learning to Interleave Thinking and Proving

**Lean-STaR: Learning to Interleave Thinking and Proving** **Authors:** Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck **Institution:** Language Technologies Institute, Carnegie Mellon University; Institute for Interdisciplinary Information Sciences, Tsinghua University **Abstract:** Traditional language model-based theorem proving assumes that training on formal proof data will enable models to prove theorems. However, informal information not present in formal proofs can also be useful. Lean-STaR is a framework that trains language models to produce informal thoughts before each step of a proof, enhancing theorem-proving capabilities. It uses retrospective ground-truth tactics to generate synthetic thoughts for training. At inference, the model directly generates thoughts before predicting tactics. Building on the Self-Taught Reasoner (STA R) framework, expert iteration further fine-tunes the model on correct proofs, verified using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark, significantly outperforming base models (43.4% → 46.3%, Pass@64). **Introduction:** Theorem proving is fundamental in mathematics and AI. Formalized mathematics provides a challenging testbed for assessing reasoning capabilities. Lean-STaR combines informal thoughts with formal proving, improving theorem-proving capabilities by interleaving natural and formal languages. It generates thought-augmented examples from Lean's Mathlib and fine-tunes a model using expert iteration. **Method:** Lean-STaR involves two phases: generating thought-augmented examples and fine-tuning the model. The first phase uses GPT-4 to generate thoughts based on human-written proofs. The second phase applies expert iteration to further improve the model's performance. **Evaluation:** Lean-STaR is evaluated on the miniF2F-test benchmark. It achieves a pass rate of 34.8% (pass@32) and 36.1% (pass@64), significantly outperforming previous state-of-the-art models. The method improves performance on various problem types and difficulties, especially in Number Theory. **Conclusion & Limitations:** Lean-STaR enhances theorem-proving capabilities by integrating Chain-of-Thought (CoT) rationales. It introduces the first thought-augmented theorem proving dataset and demonstrates that expert iteration can further improve performance. However, computational scalability and data costs are limitations.**Lean-STaR: Learning to Interleave Thinking and Proving** **Authors:** Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck **Institution:** Language Technologies Institute, Carnegie Mellon University; Institute for Interdisciplinary Information Sciences, Tsinghua University **Abstract:** Traditional language model-based theorem proving assumes that training on formal proof data will enable models to prove theorems. However, informal information not present in formal proofs can also be useful. Lean-STaR is a framework that trains language models to produce informal thoughts before each step of a proof, enhancing theorem-proving capabilities. It uses retrospective ground-truth tactics to generate synthetic thoughts for training. At inference, the model directly generates thoughts before predicting tactics. Building on the Self-Taught Reasoner (STA R) framework, expert iteration further fine-tunes the model on correct proofs, verified using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark, significantly outperforming base models (43.4% → 46.3%, Pass@64). **Introduction:** Theorem proving is fundamental in mathematics and AI. Formalized mathematics provides a challenging testbed for assessing reasoning capabilities. Lean-STaR combines informal thoughts with formal proving, improving theorem-proving capabilities by interleaving natural and formal languages. It generates thought-augmented examples from Lean's Mathlib and fine-tunes a model using expert iteration. **Method:** Lean-STaR involves two phases: generating thought-augmented examples and fine-tuning the model. The first phase uses GPT-4 to generate thoughts based on human-written proofs. The second phase applies expert iteration to further improve the model's performance. **Evaluation:** Lean-STaR is evaluated on the miniF2F-test benchmark. It achieves a pass rate of 34.8% (pass@32) and 36.1% (pass@64), significantly outperforming previous state-of-the-art models. The method improves performance on various problem types and difficulties, especially in Number Theory. **Conclusion & Limitations:** Lean-STaR enhances theorem-proving capabilities by integrating Chain-of-Thought (CoT) rationales. It introduces the first thought-augmented theorem proving dataset and demonstrates that expert iteration can further improve performance. However, computational scalability and data costs are limitations.

Lean-STaR: Learning to Interleave Thinking and Proving

8 Aug 2024 | Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck