DON'T TRUST: VERIFY – GROUNDING LLM QUANTITATIVE REASONING WITH AUTOFORMALIZATION

DON'T TRUST: VERIFY – GROUNDING LLM QUANTITATIVE REASONING WITH AUTOFORMALIZATION

26 Mar 2024 | Jin Peng Zhou*, Charles Staats†, Wenda Li, Christian Szegedy†, Kilian Q. Weinberger, Yuhuai Wu†
This paper addresses the issue of large language models (LLMs) making unjustified logical and computational errors in solving mathematical quantitative reasoning problems. The authors propose a method called "Don't Trust: Verify" (DTV) that leverages the autoformalization capability of LLMs to translate informal mathematical statements into formal Isabelle code, which can be automatically verified for internal consistency. This approach helps in rejecting solutions that are inconsistent within themselves or with the formalized problem statement. The method is evaluated on datasets such as GSM8K, MATH, and MultiArith, demonstrating a significant improvement over vanilla majority voting, with a more than 12% increase in accuracy on GSM8K. The paper also discusses the limitations of the approach, including the need for a theorem proving environment to support the problem domain and the potential for more effective filters to improve performance. The code for DTV is available at <https://github.com/jinpz/dtv>.This paper addresses the issue of large language models (LLMs) making unjustified logical and computational errors in solving mathematical quantitative reasoning problems. The authors propose a method called "Don't Trust: Verify" (DTV) that leverages the autoformalization capability of LLMs to translate informal mathematical statements into formal Isabelle code, which can be automatically verified for internal consistency. This approach helps in rejecting solutions that are inconsistent within themselves or with the formalized problem statement. The method is evaluated on datasets such as GSM8K, MATH, and MultiArith, demonstrating a significant improvement over vanilla majority voting, with a more than 12% increase in accuracy on GSM8K. The paper also discusses the limitations of the approach, including the need for a theorem proving environment to support the problem domain and the potential for more effective filters to improve performance. The code for DTV is available at <https://github.com/jinpz/dtv>.
Reach us at info@study.space