Score: 1

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

Published: May 8, 2025 | arXiv ID: 2505.05622v1

By: Weichen Zhang , Chen Gao , Shiquan Yu and more

Potential Business Impact:

Drones follow spoken directions to fly in cities.

Business Areas:

Drone Management Hardware, Software

Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose \textbf{CityNavAgent}, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at \href{https://github.com/VinceOuti/CityNavAgent}{link}.

Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

CV and Pattern Recognition

Drones fly better following spoken directions.

14 Mar 2025 1

92%

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

CV and Pattern Recognition

Drones fly themselves using only cameras and words.

9 Dec 2025 1

92%

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation

Robotics

Drones find places using words and maps.

13 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

17 pages

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

Drones follow spoken directions to fly in cities.

Technical Abstract

Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation