Score: 0

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Published: July 6, 2025 | arXiv ID: 2507.04447v3

By: Wenyao Zhang , Hongsi Liu , Zekun Qi and more

Potential Business Impact:

Robots learn to do tasks by watching and thinking.

Business Areas:

Autonomous Vehicles Transportation

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

Robotics

Helps robots learn to help people better.

18 Sep 2025 0

93%

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

CV and Pattern Recognition

Helps self-driving cars plan safer, faster trips.

16 Jun 2025 0

93%

LLaDA-VLA: Vision Language Diffusion Action Models

Robotics

Robots learn to do tasks by watching and reading.

8 Sep 2025 1

View PDF Login to Bookmark

Page Count

30 pages

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Robots learn to do tasks by watching and thinking.

Technical Abstract

CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

LLaDA-VLA: Vision Language Diffusion Action Models