DeliveryBench: Can Agents Earn Profit in Real World?
By: Lingjun Mao , Jiawei Ren , Kun Zhou and more
LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a range of VLM-based agents across nine cities and compare them with human players. Our results reveal a substantial performance gap to humans, and find that these agents are short-sighted and frequently break basic commonsense constraints. Additionally, we observe distinct personalities across models (e.g., adventurous GPT-5 vs. conservative Claude), highlighting both the brittleness and the diversity of current VLM-based embodied agents in realistic, constraint-dense environments. Our code, data, and benchmark are available at https://deliverybench.github.io.
Similar Papers
InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
Artificial Intelligence
Tests AI's ability to do real science research.
InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
Artificial Intelligence
Tests AI to help scientists discover new things faster.
VisualActBench: Can VLMs See and Act like a Human?
CV and Pattern Recognition
Teaches computers to act smartly by just watching.