Score: 1

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Published: August 12, 2025 | arXiv ID: 2508.09124v1

By: Weixuan Wang , Dongge Han , Daniel Madrigal Diaz and more

Potential Business Impact:

Tests smart computer helpers on office tasks.

Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Software Engineering

Tests AI's ability to write complex computer code.

17 Nov 2025 0

89%

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

Artificial Intelligence

Tests computers on planning complex game quests.

18 Aug 2025 1

88%

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Artificial Intelligence

Tests AI's ability to do real science research.

31 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

21 pages

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Tests smart computer helpers on office tasks.

Technical Abstract

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research