Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

27 May 2024 | Soumya Sanyal, Tianyi Xiao, Jiacheng Liu, Wenyu Wang, Xiang Ren
This paper investigates the differences in complex reasoning abilities between humans and large language models (LLMs) in the context of entailment verification (EV). The study focuses on the task of determining whether a given context supports a hypothesis, which requires multiple inferences and complex multi-hop reasoning. The authors compile an EV benchmark using datasets from three NLP domains: natural language inference (NLI), contextual question answering (QA), and rationales. They find that LLMs perform better than humans in complex reasoning tasks involving long contexts, while humans excel in simple deductive reasoning tasks. The study evaluates several LLMs, including GPT-4, and finds that GPT-4 outperforms humans on tasks requiring complex reasoning, while humans perform better on simpler tasks. The authors also fine-tune a Flan-T5 model using two training objectives—classification and ranking—to improve its performance on EV tasks. The ranking-based approach outperforms the classification-based approach, especially on contextual QA datasets, and the fine-tuned Flan-T5 model performs comparably to GPT-4. The study also demonstrates the utility of the fine-tuned model in filtering out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three multiple-choice question (MCQ) datasets. The findings highlight the strengths and limitations of both humans and LLMs in complex reasoning tasks and suggest that ranking-based training objectives can improve performance on certain types of tasks. The study contributes a new EV benchmark for evaluating LLMs and humans, and provides insights into the differences in reasoning abilities between the two.This paper investigates the differences in complex reasoning abilities between humans and large language models (LLMs) in the context of entailment verification (EV). The study focuses on the task of determining whether a given context supports a hypothesis, which requires multiple inferences and complex multi-hop reasoning. The authors compile an EV benchmark using datasets from three NLP domains: natural language inference (NLI), contextual question answering (QA), and rationales. They find that LLMs perform better than humans in complex reasoning tasks involving long contexts, while humans excel in simple deductive reasoning tasks. The study evaluates several LLMs, including GPT-4, and finds that GPT-4 outperforms humans on tasks requiring complex reasoning, while humans perform better on simpler tasks. The authors also fine-tune a Flan-T5 model using two training objectives—classification and ranking—to improve its performance on EV tasks. The ranking-based approach outperforms the classification-based approach, especially on contextual QA datasets, and the fine-tuned Flan-T5 model performs comparably to GPT-4. The study also demonstrates the utility of the fine-tuned model in filtering out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three multiple-choice question (MCQ) datasets. The findings highlight the strengths and limitations of both humans and LLMs in complex reasoning tasks and suggest that ranking-based training objectives can improve performance on certain types of tasks. The study contributes a new EV benchmark for evaluating LLMs and humans, and provides insights into the differences in reasoning abilities between the two.
Reach us at info@study.space