Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis
By: Saar Stern, Ido Sobol, Or Litany
Potential Business Impact:
Checks if computer-made pictures look real.
The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, $D_{\text{PRISM}}$, and a reference-free score, $\text{MMD}_{\text{PRISM}}$. Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where $\text{MMD}_{\text{PRISM}}$ produces a clear and stable ranking, with lower scores consistently indicating stronger models.
Similar Papers
Sphinx: Efficiently Serving Novel View Synthesis using Regression-Guided Selective Refinement
CV and Pattern Recognition
Makes 3D scenes look real, super fast.
DT-NVS: Diffusion Transformers for Novel View Synthesis
CV and Pattern Recognition
Creates new pictures of a scene from one photo.
AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
CV and Pattern Recognition
Creates realistic 3D objects from a single picture.