Score: 0

ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

Published: April 23, 2025 | arXiv ID: 2504.16464v1

By: Ying Li , Xiaobao Wei , Xiaowei Chi and more

Potential Business Impact:

Robots follow instructions better, making videos look real.

Business Areas:

Robotics Hardware, Science and Engineering, Software

While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

Robotics

Helps robots learn to move objects by seeing.

15 May 2025 0

88%

ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory

Robotics

Robots learn to move objects from pictures and words.

29 Aug 2025 0

87%

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

Robotics

Robots learn to move objects by watching how they move.

6 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

Robots follow instructions better, making videos look real.

Technical Abstract

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model