A ∧ B ⇔ B ∧ A: Triggering Logical Reasoning Failures in Large Language Models

A ∧ B ⇔ B ∧ A: Triggering Logical Reasoning Failures in Large Language Models

1 Jan 2024 | Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael Lyu
This paper introduces LogicAsker, an automated tool designed to comprehensively evaluate and improve the logical reasoning abilities of large language models (LLMs). The authors address the challenge of assessing LLMs' reasoning capabilities, which are often evaluated solely on downstream tasks rather than their reasoning processes. LogicAsker generates test cases based on atomic skills in propositional and predicate logic, identifying LLMs' weaknesses and providing insights into their reasoning abilities. The tool is evaluated on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco. The results show that LogicAsker can effectively identify logical reasoning failures in these models, with a failure rate ranging from 25% to 94%. Additionally, the test cases generated by LogicAsker can be used to design in-context learning demonstrations, which improve the LLMs' logical reasoning abilities, such as increasing GPT-4's accuracy from 75% to 85%. The paper also discusses the limitations and future directions of the work, emphasizing the importance of responsible AI deployment and the reliability of LLMs. All code, data, and results are available for reproducibility and further research.This paper introduces LogicAsker, an automated tool designed to comprehensively evaluate and improve the logical reasoning abilities of large language models (LLMs). The authors address the challenge of assessing LLMs' reasoning capabilities, which are often evaluated solely on downstream tasks rather than their reasoning processes. LogicAsker generates test cases based on atomic skills in propositional and predicate logic, identifying LLMs' weaknesses and providing insights into their reasoning abilities. The tool is evaluated on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco. The results show that LogicAsker can effectively identify logical reasoning failures in these models, with a failure rate ranging from 25% to 94%. Additionally, the test cases generated by LogicAsker can be used to design in-context learning demonstrations, which improve the LLMs' logical reasoning abilities, such as increasing GPT-4's accuracy from 75% to 85%. The paper also discusses the limitations and future directions of the work, emphasizing the importance of responsible AI deployment and the reliability of LLMs. All code, data, and results are available for reproducibility and further research.
Reach us at info@study.space
[slides and audio] LogicAsker%3A Evaluating and Improving the Logical Reasoning Ability of Large Language Models