DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
By: Yinger Zhang , Shutong Jiang , Renhao Li and more
Potential Business Impact:
Helps AI plan trips and shop better.
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
Similar Papers
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
Artificial Intelligence
Computers struggle to plan many steps ahead.
AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
Artificial Intelligence
Helps robots plan space missions better.
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
Artificial Intelligence
Tests computers on planning complex game quests.