TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning
By: Xiang Cheng , Yulan Hu , Xiangwen Zhang and more
Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.
Similar Papers
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
Artificial Intelligence
Makes travel plans better and more real.
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
Artificial Intelligence
Helps AI plan cheaper trips by learning from mistakes.
Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
Computation and Language
Tests if AI can change plans when things change.