A ∧ B ⇔ B ∧ A: Triggering Logical Reasoning Failures in Large Language Models

A ∧ B ⇔ B ∧ A: Triggering Logical Reasoning Failures in Large Language Models

1 Jan 2024 | Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael Lyu
This paper introduces LogicAsker, an automatic framework to evaluate and improve the formal reasoning ability of large language models (LLMs). The framework is based on propositional and predicate logic, two fundamental systems used to formalize reasoning. LogicAsker systematically generates reasoning questions by converting standard logic expressions into natural language and evaluates LLMs' performance on these questions. It identifies weaknesses in LLMs' reasoning abilities and generates demonstration examples to improve their reasoning capacity using in-context learning techniques. The framework is evaluated on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco. The results show that test cases generated by LogicAsker can find logical reasoning failures in different LLMs with a rate of 25% - 94%. Furthermore, the test cases can be used to design demonstration examples for in-context learning, which effectively improves the logical reasoning ability of LLMs, e.g., 10% for GPT-4. The paper also discusses the challenges in evaluating LLMs' reasoning abilities, including the difficulty in defining a comprehensive set of reasoning skills, the lack of a system that can organize test cases to cover all formal reasoning scenarios, and the limited scope of existing benchmarks. The authors argue that formal reasoning is more structured and reliable and is widely used in important software engineering tasks, such as type inference and program repair. The paper concludes that LogicAsker is the first to create prompts based on testing results to effectively improve LLMs' formal reasoning ability. All the code, data, and results will be released for reproduction and future research.This paper introduces LogicAsker, an automatic framework to evaluate and improve the formal reasoning ability of large language models (LLMs). The framework is based on propositional and predicate logic, two fundamental systems used to formalize reasoning. LogicAsker systematically generates reasoning questions by converting standard logic expressions into natural language and evaluates LLMs' performance on these questions. It identifies weaknesses in LLMs' reasoning abilities and generates demonstration examples to improve their reasoning capacity using in-context learning techniques. The framework is evaluated on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco. The results show that test cases generated by LogicAsker can find logical reasoning failures in different LLMs with a rate of 25% - 94%. Furthermore, the test cases can be used to design demonstration examples for in-context learning, which effectively improves the logical reasoning ability of LLMs, e.g., 10% for GPT-4. The paper also discusses the challenges in evaluating LLMs' reasoning abilities, including the difficulty in defining a comprehensive set of reasoning skills, the lack of a system that can organize test cases to cover all formal reasoning scenarios, and the limited scope of existing benchmarks. The authors argue that formal reasoning is more structured and reliable and is widely used in important software engineering tasks, such as type inference and program repair. The paper concludes that LogicAsker is the first to create prompts based on testing results to effectively improve LLMs' formal reasoning ability. All the code, data, and results will be released for reproduction and future research.
Reach us at info@study.space