True Self-Supervised Novel View Synthesis is Transferable
By: Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann
Potential Business Impact:
Lets computers create new views of a scene.
In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.
Similar Papers
Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
CV and Pattern Recognition
Creates realistic 3D scenes from photos.
DT-NVS: Diffusion Transformers for Novel View Synthesis
CV and Pattern Recognition
Creates new pictures of a scene from one photo.
AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
CV and Pattern Recognition
Creates realistic 3D objects from a single picture.