TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
By: Shaohang Wei , Wei Li , Feifan Song and more
Potential Business Impact:
Helps computers understand time and events better.
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .
Similar Papers
Time-R1: Towards Comprehensive Temporal Reasoning in LLMs
Computation and Language
AI can now understand and imagine future events.
LexTime: A Benchmark for Temporal Ordering of Legal Events
Computation and Language
Helps computers understand the order of events in laws.
MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
Computation and Language
Helps computers understand time and events better.