Igniting VLMs toward the Embodied Space
By: Andy Zhai , Brae Liu , Bruno Fang and more
Potential Business Impact:
Robots learn to understand and do tasks from words.
While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.
Similar Papers
iFlyBot-VLM Technical Report
Robotics
Robots learn to move and act by seeing.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Robotics
Robots learn to do tasks by watching and listening.
10 Open Challenges Steering the Future of Vision-Language-Action Models
Robotics
Robots learn to follow spoken commands and act.