Score: 1

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation

Published: April 13, 2025 | arXiv ID: 2504.09587v3

By: Haotian Xu , Yue Hu , Chen Gao and more

Potential Business Impact:

Drones find places using words and maps.

Business Areas:

Geospatial Data and Analytics, Navigation and Mapping

Language-goal aerial navigation is a critical challenge in embodied AI, requiring UAVs to localize targets in complex environments such as urban blocks based on textual specification. Existing methods, often adapted from indoor navigation, struggle to scale due to limited field of view, semantic ambiguity among objects, and lack of structured spatial reasoning. In this work, we propose GeoNav, a geospatially aware multimodal agent to enable long-range navigation. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial strategies. To support such reasoning, it dynamically builds two different types of spatial memory. The first is a global but schematic cognitive map, which fuses prior textual geographic knowledge and embodied visual cues into a top-down, annotated form for fast navigation to the landmark region. The second is a local but delicate scene graph representing hierarchical spatial relationships between blocks, landmarks, and objects, which is used for definite target localization. On top of this structured representation, GeoNav employs a spatially aware, multimodal chain-of-thought prompting mechanism to enable multimodal large language models with efficient and interpretable decision-making across stages. On the CityNav urban navigation benchmark, GeoNav surpasses the current state-of-the-art by up to 12.53% in success rate and significantly improves navigation efficiency, even in hard-level tasks. Ablation studies highlight the importance of each module, showcasing how geospatial representations and coarse-to-fine reasoning enhance UAV navigation.

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

Robotics

Drones follow spoken directions to fly in cities.

8 May 2025 1

91%

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

Artificial Intelligence

Helps robots see and move without getting lost.

9 Apr 2025 1

91%

MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

CV and Pattern Recognition

Helps robots follow directions and remember places.

20 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation

Drones find places using words and maps.

Technical Abstract

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning