TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model
By: Yabo Chen , Yuanzhi Liang , Jiepeng Wang and more
Potential Business Impact:
AI learns to remember and interact with changing worlds.
World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.
Similar Papers
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
CV and Pattern Recognition
Creates realistic videos with controllable objects and cameras.
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
CV and Pattern Recognition
Makes videos that stay real and make sense.
VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation
CV and Pattern Recognition
Builds realistic worlds that follow rules.