This paper introduces a new benchmark, Formalization for Lean 4 (FORML4), to evaluate the autoformalization capabilities of large language models (LLMs) in the Lean 4 formal language. Autoformalization involves converting natural language mathematics into formal languages, which is crucial for advancing mathematical reasoning. Existing efforts are limited to formal languages with substantial online corpora, but Lean 4 is rapidly evolving, making it challenging to keep up with its syntax, semantics, and library updates. FORML4 provides a comprehensive dataset that includes questions, answers, formal statements, and proofs, enabling a more accurate assessment of LLMs' ability to understand Lean 4 syntax and reason logically within its framework.
To enhance autoformalization, the paper proposes a Process-Supervised Verifier (PSV) model that leverages detailed feedback from Lean 4 compilers. This model uses step-level information from the compiler to improve the autoformalization process. Experiments show that PSV significantly enhances autoformalization, especially when trained on data with detailed process information. The PSV model outperforms the Outcome-Supervised Verifier (OSV) in terms of performance metrics such as precision and recall.
The paper also discusses the limitations of existing LLMs in autoformalization tasks on FORML4, highlighting the need for improved methods. The results demonstrate that the PSV model achieves better performance across various test sets compared to the OSV model. Additionally, the study shows that increasing the size of the training data improves the model's performance on test sets, although the real test set remains challenging due to out-of-distribution domains and lack of dependency on pre-defined lemmas.
The paper concludes that the proposed benchmark and method offer significant potential for improving autoformalization capabilities of LLMs, and future work will extend these methods to other formal languages such as Isabelle, HOL Light, and Coq. The iterative training process involving the autoformalizer, verifier, and Lean 4 compiler is shown to be a promising direction for further advancements in autoformalization.This paper introduces a new benchmark, Formalization for Lean 4 (FORML4), to evaluate the autoformalization capabilities of large language models (LLMs) in the Lean 4 formal language. Autoformalization involves converting natural language mathematics into formal languages, which is crucial for advancing mathematical reasoning. Existing efforts are limited to formal languages with substantial online corpora, but Lean 4 is rapidly evolving, making it challenging to keep up with its syntax, semantics, and library updates. FORML4 provides a comprehensive dataset that includes questions, answers, formal statements, and proofs, enabling a more accurate assessment of LLMs' ability to understand Lean 4 syntax and reason logically within its framework.
To enhance autoformalization, the paper proposes a Process-Supervised Verifier (PSV) model that leverages detailed feedback from Lean 4 compilers. This model uses step-level information from the compiler to improve the autoformalization process. Experiments show that PSV significantly enhances autoformalization, especially when trained on data with detailed process information. The PSV model outperforms the Outcome-Supervised Verifier (OSV) in terms of performance metrics such as precision and recall.
The paper also discusses the limitations of existing LLMs in autoformalization tasks on FORML4, highlighting the need for improved methods. The results demonstrate that the PSV model achieves better performance across various test sets compared to the OSV model. Additionally, the study shows that increasing the size of the training data improves the model's performance on test sets, although the real test set remains challenging due to out-of-distribution domains and lack of dependency on pre-defined lemmas.
The paper concludes that the proposed benchmark and method offer significant potential for improving autoformalization capabilities of LLMs, and future work will extend these methods to other formal languages such as Isabelle, HOL Light, and Coq. The iterative training process involving the autoformalizer, verifier, and Lean 4 compiler is shown to be a promising direction for further advancements in autoformalization.