LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

6 Jun 2024 | Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral
The paper introduces *LogicBench*, a comprehensive dataset designed to evaluate the logical reasoning abilities of large language models (LLMs). The dataset covers 25 different reasoning patterns across propositional, first-order, and non-monotonic logics, focusing on a single inference rule at a time. The authors conduct experiments with various LLMs, including GPT-4, ChatGPT, Gemini, Llama-2, and Mistral, using chain-of-thought prompting. The results show that existing LLMs struggle with complex reasoning and negations, often overlooking contextual information necessary for correct conclusions. The paper also discusses the limitations of current LLMs in logical reasoning and suggests future directions for improvement, such as enhancing the depth of reasoning complexity and improving performance on non-monotonic logics. Additionally, the authors demonstrate that fine-tuning LLMs on *LogicBench* can improve their logical reasoning abilities, leading to better performance on other logic datasets.The paper introduces *LogicBench*, a comprehensive dataset designed to evaluate the logical reasoning abilities of large language models (LLMs). The dataset covers 25 different reasoning patterns across propositional, first-order, and non-monotonic logics, focusing on a single inference rule at a time. The authors conduct experiments with various LLMs, including GPT-4, ChatGPT, Gemini, Llama-2, and Mistral, using chain-of-thought prompting. The results show that existing LLMs struggle with complex reasoning and negations, often overlooking contextual information necessary for correct conclusions. The paper also discusses the limitations of current LLMs in logical reasoning and suggests future directions for improvement, such as enhancing the depth of reasoning complexity and improving performance on non-monotonic logics. Additionally, the authors demonstrate that fine-tuning LLMs on *LogicBench* can improve their logical reasoning abilities, leading to better performance on other logic datasets.
Reach us at info@study.space