13 Jun 2024 | Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi
This article introduces the Test of Time (ToT) benchmark, a novel dataset designed to evaluate large language models (LLMs) on their ability to reason about time. The authors highlight the limitations of existing benchmarks, which often rely on real-world data or anonymized facts that may introduce factual inconsistencies. To address these issues, the authors develop two tasks: ToT-Semantic and ToT-Arithmetic. ToT-Semantic focuses on temporal semantics and logic, using synthetic data to explore diverse graph structures and reasoning tasks. ToT-Arithmetic, on the other hand, assesses LLMs' ability to perform time-based calculations. The authors also provide an open-source dataset and evaluation framework, available at https://huggingface.co/datasets/baharef/ToT.
The study evaluates three leading LLMs: GPT-4, Gemini 1.5 Pro, and Claude-3-Sonnet. The results show that the structure of the temporal data significantly impacts LLM performance, with complete graphs yielding higher accuracy than sparse ones. The type of temporal question also affects performance, with some tasks being more challenging than others. Additionally, the order of facts in the prompt influences LLM performance, with certain sorting methods leading to better results. The study also finds that LLMs perform better in tasks involving timezones and simple arithmetic, but struggle with more complex tasks like duration calculations and leap year computations.
The authors conclude that ToT provides a more comprehensive and controlled assessment of LLMs' temporal reasoning abilities compared to existing benchmarks. By open-sourcing the dataset and evaluation framework, the authors aim to foster further research and development in this area. The study highlights the importance of considering both semantic and arithmetic aspects of temporal reasoning and underscores the need for more diverse and realistic benchmarks to evaluate LLM capabilities accurately.This article introduces the Test of Time (ToT) benchmark, a novel dataset designed to evaluate large language models (LLMs) on their ability to reason about time. The authors highlight the limitations of existing benchmarks, which often rely on real-world data or anonymized facts that may introduce factual inconsistencies. To address these issues, the authors develop two tasks: ToT-Semantic and ToT-Arithmetic. ToT-Semantic focuses on temporal semantics and logic, using synthetic data to explore diverse graph structures and reasoning tasks. ToT-Arithmetic, on the other hand, assesses LLMs' ability to perform time-based calculations. The authors also provide an open-source dataset and evaluation framework, available at https://huggingface.co/datasets/baharef/ToT.
The study evaluates three leading LLMs: GPT-4, Gemini 1.5 Pro, and Claude-3-Sonnet. The results show that the structure of the temporal data significantly impacts LLM performance, with complete graphs yielding higher accuracy than sparse ones. The type of temporal question also affects performance, with some tasks being more challenging than others. Additionally, the order of facts in the prompt influences LLM performance, with certain sorting methods leading to better results. The study also finds that LLMs perform better in tasks involving timezones and simple arithmetic, but struggle with more complex tasks like duration calculations and leap year computations.
The authors conclude that ToT provides a more comprehensive and controlled assessment of LLMs' temporal reasoning abilities compared to existing benchmarks. By open-sourcing the dataset and evaluation framework, the authors aim to foster further research and development in this area. The study highlights the importance of considering both semantic and arithmetic aspects of temporal reasoning and underscores the need for more diverse and realistic benchmarks to evaluate LLM capabilities accurately.