Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes
By: Shuyun Wang , Haiyang Sun , Bing Wang and more
Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.
Similar Papers
MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World
CV and Pattern Recognition
Makes computer pictures show real-looking reflections.
View-Consistent Diffusion Representations for 3D-Consistent Video Generation
CV and Pattern Recognition
Makes computer-made videos look more real.
MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation
CV and Pattern Recognition
Makes self-driving cars see in 3D and color.