13 Jun 2024 | Zhaochen Su, Juntao Li, Jun Zhang, Tong Zhu, Xiaoye Qu, Pan Zhou, Bowen Yan, Yu Cheng, Min Zhang
The paper introduces CoTEMPQA, a comprehensive benchmark for evaluating large language models (LLMs) on co-temporal reasoning tasks. CoTEMPQA consists of four scenarios—Equal, Overlap, During, and Mix—comprising 4,748 samples. The study reveals a significant gap between LLMs' performance and human-level reasoning, even with enhancements like Chain of Thought (CoT) methodologies. The authors discover that mathematical reasoning plays a crucial role in handling co-temporal events and propose a MATH-REASONING CoT (Mr-CoT) strategy to improve LLMs' co-temporal reasoning. This strategy achieves a notable improvement over existing baselines but still falls short of human-level performance. The research highlights the need for further advancements to enhance LLMs' co-temporal reasoning capabilities, which are essential for understanding and reasoning about concurrent events in real-world scenarios. The dataset and code are available for further research.The paper introduces CoTEMPQA, a comprehensive benchmark for evaluating large language models (LLMs) on co-temporal reasoning tasks. CoTEMPQA consists of four scenarios—Equal, Overlap, During, and Mix—comprising 4,748 samples. The study reveals a significant gap between LLMs' performance and human-level reasoning, even with enhancements like Chain of Thought (CoT) methodologies. The authors discover that mathematical reasoning plays a crucial role in handling co-temporal events and propose a MATH-REASONING CoT (Mr-CoT) strategy to improve LLMs' co-temporal reasoning. This strategy achieves a notable improvement over existing baselines but still falls short of human-level performance. The research highlights the need for further advancements to enhance LLMs' co-temporal reasoning capabilities, which are essential for understanding and reasoning about concurrent events in real-world scenarios. The dataset and code are available for further research.