Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System
By: Haokun Liu , Zhaoqi Ma , Yunong Li and more
Potential Business Impact:
Robots work together better using AI to move things.
Heterogeneous multi-robot systems show great potential in complex tasks requiring hybrid cooperation. However, traditional approaches relying on static models often struggle with task diversity and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a GridMask-enhanced fine-tuned Vision Language Model (VLM). The LLM decomposes tasks and constructs a global semantic map, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning. Within this framework, the aerial robot follows an optimized global semantic path and continuously provides bird-view images, guiding the ground robot's local semantic navigation and manipulation, including target-absent scenarios where implicit alignment is maintained. Experiments on real-world cube or object arrangement tasks demonstrate the framework's adaptability and robustness in dynamic environments. To the best of our knowledge, this is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
Similar Papers
AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models
Robotics
Lets drones do tasks safely using words.
Air-Ground Collaboration for Language-Specified Missions in Unknown Environments
Robotics
Robots understand spoken commands to work together.
General-Purpose Aerial Intelligent Agents Empowered by Large Language Models
Robotics
Drones can now figure out new jobs on their own.