Score: 1

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Published: December 16, 2025 | arXiv ID: 2512.14222v1

By: Xichen Ding , Jianzhe Gao , Cong Pan and more

Potential Business Impact:

Drones find places better using maps and memories.

Business Areas:

Navigation Navigation and Mapping

Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

CV and Pattern Recognition

Drones fly themselves using only cameras and words.

9 Dec 2025 1

89%

Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

CV and Pattern Recognition

Drones fly better following spoken directions.

14 Mar 2025 1

89%

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

Robotics

Drones follow spoken directions to fly in cities.

8 May 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

9 pages

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Drones find places better using maps and memories.

Technical Abstract

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory