2024 | Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q. Weinberger, Yuhuai Wu
This paper presents a method called Don't Trust: Verify (DTV) to improve the accuracy of large language models (LLMs) in solving quantitative reasoning problems. The approach leverages the ability of LLMs to autoformalize informal mathematical statements into formal code, which can then be verified using automated theorem provers. By translating informal solutions into formal logic and verifying them, DTV can identify correct answers more reliably than traditional methods like majority voting.
The key idea is to use formal theorem proving environments to verify the correctness of informal solutions. This involves translating informal statements and solutions into formal logic, then using automated theorem provers to check the validity of the formalized statements. If a formalized statement is verified, the corresponding informal solution is considered correct. This process helps to filter out incorrect solutions and improve the overall accuracy of the model.
DTV is evaluated on several datasets, including GSM8K, MATH, and MultiArith. The results show that DTV outperforms majority voting by more than 12% on GSM8K. The method is effective across different model sizes and is complementary to other approaches that rely on informal reasoning. The paper also discusses the limitations of the approach, including the need for a suitable theorem proving environment and the challenges of translating informal statements into formal logic.
The study demonstrates that autoformalization can significantly enhance the reliability of LLMs in solving quantitative reasoning problems. By combining the strengths of LLMs with formal verification, DTV provides a robust method for identifying correct answers in mathematical reasoning tasks.This paper presents a method called Don't Trust: Verify (DTV) to improve the accuracy of large language models (LLMs) in solving quantitative reasoning problems. The approach leverages the ability of LLMs to autoformalize informal mathematical statements into formal code, which can then be verified using automated theorem provers. By translating informal solutions into formal logic and verifying them, DTV can identify correct answers more reliably than traditional methods like majority voting.
The key idea is to use formal theorem proving environments to verify the correctness of informal solutions. This involves translating informal statements and solutions into formal logic, then using automated theorem provers to check the validity of the formalized statements. If a formalized statement is verified, the corresponding informal solution is considered correct. This process helps to filter out incorrect solutions and improve the overall accuracy of the model.
DTV is evaluated on several datasets, including GSM8K, MATH, and MultiArith. The results show that DTV outperforms majority voting by more than 12% on GSM8K. The method is effective across different model sizes and is complementary to other approaches that rely on informal reasoning. The paper also discusses the limitations of the approach, including the need for a suitable theorem proving environment and the challenges of translating informal statements into formal logic.
The study demonstrates that autoformalization can significantly enhance the reliability of LLMs in solving quantitative reasoning problems. By combining the strengths of LLMs with formal verification, DTV provides a robust method for identifying correct answers in mathematical reasoning tasks.