Score: 1

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Published: November 25, 2025 | arXiv ID: 2511.19859v1

By: Xiangkai Ma , Lekai Xing , Han Zhang and more

Potential Business Impact:

Robot learns to do tasks by watching and thinking.

Business Areas:

Computer Vision Hardware, Software

Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5\%, 9.6\% and 12.1\% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5\% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.

GraphCoT-VLA: A 3D Spatial-Aware Reasoning Vision-Language-Action Model for Robotic Manipulation with Ambiguous Instructions

Robotics

Robots understand confusing orders and see in 3D.

11 Aug 2025 0

91%

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Robotics

Robots learn to build things by watching goals.

1 Dec 2025 1

90%

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

CV and Pattern Recognition

Teaches robots to do tasks using sight and words.

10 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

27 pages

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Robot learns to do tasks by watching and thinking.

Technical Abstract

GraphCoT-VLA: A 3D Spatial-Aware Reasoning Vision-Language-Action Model for Robotic Manipulation with Ambiguous Instructions

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation