13 Jun 2024 | Bahare Fatemi¹, Mehran Kazemi², Anton Tsitsulin¹, Karishma Malkan², Jin yeong Yim³, John Palowitch², Sungyong Seo³, Jonathan Halcrow¹, and Bryan Perozzi¹
This paper introduces Test of Time (ToT), a novel benchmark for evaluating large language models (LLMs) on temporal reasoning. Existing benchmarks often rely on real-world data or anonymized facts that may introduce factual inconsistencies. ToT addresses these limitations by introducing synthetic datasets designed to assess LLMs' temporal reasoning abilities in various scenarios. The benchmark includes two tasks: ToT-Semantic, which focuses on temporal semantics and logic, and ToT-Arithmetic, which assesses time arithmetic skills. The synthetic nature of ToT allows for systematic investigation into how problem structure, size, question type, and fact order affect LLM performance. The benchmark is open-sourced, enabling further research and development in temporal reasoning. Experiments show that LLMs struggle with complex temporal tasks, particularly those involving arithmetic and duration calculations. The results highlight the importance of fact order and the need for more comprehensive evaluation methods in temporal reasoning. ToT provides a more controlled and diverse assessment of LLM capabilities, offering insights into their strengths and weaknesses in temporal reasoning tasks.This paper introduces Test of Time (ToT), a novel benchmark for evaluating large language models (LLMs) on temporal reasoning. Existing benchmarks often rely on real-world data or anonymized facts that may introduce factual inconsistencies. ToT addresses these limitations by introducing synthetic datasets designed to assess LLMs' temporal reasoning abilities in various scenarios. The benchmark includes two tasks: ToT-Semantic, which focuses on temporal semantics and logic, and ToT-Arithmetic, which assesses time arithmetic skills. The synthetic nature of ToT allows for systematic investigation into how problem structure, size, question type, and fact order affect LLM performance. The benchmark is open-sourced, enabling further research and development in temporal reasoning. Experiments show that LLMs struggle with complex temporal tasks, particularly those involving arithmetic and duration calculations. The results highlight the importance of fact order and the need for more comprehensive evaluation methods in temporal reasoning. ToT provides a more controlled and diverse assessment of LLM capabilities, offering insights into their strengths and weaknesses in temporal reasoning tasks.