Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
By: John Won , Kyungmin Lee , Huiwon Jang and more
Potential Business Impact:
Teaches robots to do tasks by watching videos.
Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods, with our inference-time scaling approach providing an additional 2-5% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
Similar Papers
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
CV and Pattern Recognition
Helps robots learn to do tasks better.
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
Robotics
Robots learn to do tasks by watching and understanding.
LLaDA-VLA: Vision Language Diffusion Action Models
Robotics
Robots learn to do tasks by watching and reading.