Score: 0

TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

Published: January 14, 2026 | arXiv ID: 2601.09523v1

By: Abdelrahman Abdallah , Mohammed Ali , Muhammad Abdul-Mageed and more

Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial challenges: the best model (DiVeR) achieves only 32.0 NDCG@10 and 71.4\% Temporal Coverage@10, demonstrating difficulty in retrieving temporally complete evidence. We believe TEMPO provides a challenging benchmark for improving temporal reasoning in retrieval and RAG systems. Our code and data are available at https://github.com/tempo-bench/Tempo. See also our official website: https://tempo-bench.github.io/.

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Artificial Intelligence

Helps computers understand time and events better.

19 May 2025 3

90%

RECOR: Reasoning-focused Multi-turn Conversational Retrieval Benchmark

Information Retrieval

Helps computers answer questions by talking and thinking.

9 Jan 2026 1

89%

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Artificial Intelligence

Tests how well computers can think step-by-step.

31 Oct 2025 2

View PDF Login to Bookmark

TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

Technical Abstract

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

RECOR: Reasoning-focused Multi-turn Conversational Retrieval Benchmark

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance