Score: 0

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Published: November 27, 2025 | arXiv ID: 2511.22134v1

By: Zhen Fang , Zhuoyang Liu , Jiaming Liu and more

Potential Business Impact:

Teaches robots to act and think better.

Business Areas:

Autonomous Vehicles Transportation

To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars drive smarter and faster.

25 Nov 2025 1

93%

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

Robotics

Robots learn to plan, fix mistakes, and interact.

17 May 2025 1

93%

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

CV and Pattern Recognition

Helps robots understand where things are better.

15 Oct 2025 0

View PDF Login to Bookmark

Page Count

20 pages

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Teaches robots to act and think better.

Technical Abstract

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning