The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
By: Hanlin Wang , Hao Ouyang , Qiuyu Wang and more
Potential Business Impact:
Creates videos of anything you can imagine.
We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.
Similar Papers
IC-World: In-Context Generation for Shared World Modeling
CV and Pattern Recognition
Creates consistent 3D worlds from many pictures.
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
CV and Pattern Recognition
Creates pictures from many instructions at once.
Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
CV and Pattern Recognition
Makes game worlds react to your spoken commands.