[slides] Are Machines Better at Complex Reasoning%3F Unveiling Human-Machine Inference Gaps in Entailment Verification

This paper investigates the entailment verification (EV) problem, which involves making multiple inferences from multi-sentence premises to determine if a hypothesis is supported. The study compiles an EV benchmark that includes datasets from three NLP domains—Natural Language Inference (NLI), contextual QA, and rationales—each containing multi-sentence premises. The benchmark is designed to capture complex reasoning tasks that require multi-hop reasoning and the ability to infer missing information. The paper compares the performance of humans and large language models (LLMs) on this benchmark. It finds that LLMs excel in multi-hop reasoning over extended contexts, while humans perform better in simple deductive reasoning tasks. To address the limitations of current LLMs, the authors fine-tune a Flan-T5 model using two training objectives: classification and ranking. The fine-tuned Flan-T5 model outperforms GPT-3.5 and rivals GPT-4, achieving a 6% accuracy improvement on average across three multiple-choice question (MCQ) datasets when used to filter out inconsistent model-generated rationales in self-consistency decoding. The ranking-based objective is particularly beneficial for contextual QA datasets, demonstrating its effectiveness in improving generalization. Overall, the paper contributes to the development of a robust open-source model for entailment verification and highlights the importance of fine-tuning LLMs for specific tasks to bridge the gap between human and machine reasoning capabilities.This paper investigates the entailment verification (EV) problem, which involves making multiple inferences from multi-sentence premises to determine if a hypothesis is supported. The study compiles an EV benchmark that includes datasets from three NLP domains—Natural Language Inference (NLI), contextual QA, and rationales—each containing multi-sentence premises. The benchmark is designed to capture complex reasoning tasks that require multi-hop reasoning and the ability to infer missing information. The paper compares the performance of humans and large language models (LLMs) on this benchmark. It finds that LLMs excel in multi-hop reasoning over extended contexts, while humans perform better in simple deductive reasoning tasks. To address the limitations of current LLMs, the authors fine-tune a Flan-T5 model using two training objectives: classification and ranking. The fine-tuned Flan-T5 model outperforms GPT-3.5 and rivals GPT-4, achieving a 6% accuracy improvement on average across three multiple-choice question (MCQ) datasets when used to filter out inconsistent model-generated rationales in self-consistency decoding. The ranking-based objective is particularly beneficial for contextual QA datasets, demonstrating its effectiveness in improving generalization. Overall, the paper contributes to the development of a robust open-source model for entailment verification and highlights the importance of fine-tuning LLMs for specific tasks to bridge the gap between human and machine reasoning capabilities.

Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

27 May 2024 | Soumya Sanyal, Tianyi Xiao, Jiacheng Liu, Weny Wang, Xiang Ren