SS4D: Native 4D Generative Model via Structured Spacetime Latents
By: Zhibing Li , Mengchen Zhang , Tong Wu and more
Potential Business Impact:
Makes 3D objects move realistically from one video.
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion
Similar Papers
Inferring Compositional 4D Scenes without Ever Seeing One
CV and Pattern Recognition
Builds 3D worlds from videos, showing moving objects.
Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation
CV and Pattern Recognition
Creates realistic 3D videos from one picture.
ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
CV and Pattern Recognition
Turns videos into moving 3D models.