Score: 0

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

Published: December 11, 2025 | arXiv ID: 2512.11047v2

By: Haoran Jiang , Jin Chen , Qingwen Bu and more

Potential Business Impact:

Robots can now reach and grab things anywhere.

Business Areas:

Autonomous Vehicles Transportation

Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

Robotics

Robots can now reach and grab things anywhere.

11 Dec 2025 0

92%

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Robotics

Robots learn to build things by watching goals.

1 Dec 2025 1

91%

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

Robotics

Robots learn to do many new tasks by watching.

16 Jun 2025 0

View PDF Login to Bookmark

Page Count

23 pages

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

Robots can now reach and grab things anywhere.

Technical Abstract

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction