AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
By: Xuying Zhang , Yupeng Zhou , Kai Wang and more
Potential Business Impact:
Creates realistic 3D objects from a single picture.
Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
Similar Papers
The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge
CV and Pattern Recognition
Creates new pictures of things from few photos.
Geometric Consistency Refinement for Single Image Novel View Synthesis via Test-Time Adaptation of Diffusion Models
CV and Pattern Recognition
Makes 3D pictures look real from any angle.
Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers
CV and Pattern Recognition
Creates 3D pictures from different viewpoints.