Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning
By: Xiuxiu Qi , Yu Yang , Jiannong Cao and more
Potential Business Impact:
Robots learn to do tasks smoothly from watching humans.
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
Similar Papers
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
Robotics
Teaches robots to follow spoken commands precisely.
Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics
Robotics
Robots change how they move based on your words.
Continually Evolving Skill Knowledge in Vision Language Action Model
Robotics
Robots learn new skills without constant retraining.