LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

6 Jun 2024 | Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral
LogicBench is a new dataset designed to systematically evaluate the logical reasoning ability of large language models (LLMs). It includes 25 different reasoning patterns across propositional, first-order, and non-monotonic logics. The dataset focuses on evaluating a single inference rule, enabling a more precise assessment of LLMs' logical reasoning capabilities. LogicBench includes two tasks: Binary Question-Answering (BQA) and Multiple-Choice Questions-Answering (MCQA). BQA requires answering "yes" or "no" based on logical entailment, while MCQA involves selecting the correct conclusion from four options. The dataset was created through a three-stage process involving sentence generation, natural language conversion, and task instance generation. The dataset was evaluated using several LLMs, including GPT-4, ChatGPT, Gemini, Llama-2, and Mistral. Results showed that these models struggled with complex reasoning and negations, often overlooking contextual information necessary for accurate conclusions. The study also found that larger models generally performed better in logical reasoning tasks. Additionally, the dataset was augmented and used to fine-tune T5-large, which showed improved performance on existing logic datasets. The findings indicate that logical reasoning remains a significant challenge for LLMs, with room for improvement in their ability to handle complex logical dependencies and negations. The study highlights the importance of systematic evaluation in understanding and enhancing the logical reasoning capabilities of LLMs.LogicBench is a new dataset designed to systematically evaluate the logical reasoning ability of large language models (LLMs). It includes 25 different reasoning patterns across propositional, first-order, and non-monotonic logics. The dataset focuses on evaluating a single inference rule, enabling a more precise assessment of LLMs' logical reasoning capabilities. LogicBench includes two tasks: Binary Question-Answering (BQA) and Multiple-Choice Questions-Answering (MCQA). BQA requires answering "yes" or "no" based on logical entailment, while MCQA involves selecting the correct conclusion from four options. The dataset was created through a three-stage process involving sentence generation, natural language conversion, and task instance generation. The dataset was evaluated using several LLMs, including GPT-4, ChatGPT, Gemini, Llama-2, and Mistral. Results showed that these models struggled with complex reasoning and negations, often overlooking contextual information necessary for accurate conclusions. The study also found that larger models generally performed better in logical reasoning tasks. Additionally, the dataset was augmented and used to fine-tune T5-large, which showed improved performance on existing logic datasets. The findings indicate that logical reasoning remains a significant challenge for LLMs, with room for improvement in their ability to handle complex logical dependencies and negations. The study highlights the importance of systematic evaluation in understanding and enhancing the logical reasoning capabilities of LLMs.
Reach us at info@study.space