WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
By: Xuweiyi Chen, Wentao Zhou, Zezhou Cheng
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
Similar Papers
RayZer: A Self-supervised Large View Synthesis Model
CV and Pattern Recognition
Makes computers understand 3D from flat pictures.
Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos
CV and Pattern Recognition
Makes 3D videos look real from any angle.
DT-NVS: Diffusion Transformers for Novel View Synthesis
CV and Pattern Recognition
Creates new pictures of a scene from one photo.