Score: 1

Image Generation as a Visual Planner for Robotic Manipulation

Published: November 29, 2025 | arXiv ID: 2512.00532v1

By: Ye Pang

Potential Business Impact:

Lets computers plan robot actions by watching videos.

Business Areas:
Image Recognition Data and Analytics, Software

Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.

Country of Origin
🇨🇳 China

Repos / Data Links

Page Count
11 pages

Category
Computer Science:
CV and Pattern Recognition