Score: 0

CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

Published: December 18, 2025 | arXiv ID: 2512.16755v1

By: Siqi Wang , Chao Liang , Yunfan Gao and more

Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

CV and Pattern Recognition

Helps computers understand city streets better.

29 Aug 2025 1

90%

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

Robotics

Robots learn to explore and do tasks better.

16 Aug 2025 0

90%

Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning

Robotics

Helps robots explore new places much faster.

12 Nov 2025 0

View PDF Login to Bookmark

CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

Technical Abstract

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning