Omnidirectional Spatial Modeling from Correlated Panoramas
By: Xinshen Zhang, Tongxi Fu, Xu Zheng
Potential Business Impact:
Helps robots understand 360° views better.
Omnidirectional scene understanding is vital for various downstream applications, such as embodied AI, autonomous driving, and immersive environments, yet remains challenging due to geometric distortion and complex spatial relations in 360{\deg} imagery. Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas. To bridge this gap, we introduce \textbf{CFpano}, the \textbf{first} benchmark dataset dedicated to cross-frame correlated panoramas visual question answering in the holistic 360{\deg} scenes. CFpano consists of over 2700 images together with over 8000 question-answer pairs, and the question types include both multiple choice and open-ended VQA. Building upon our CFpano, we further present \methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization (GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas. Benchmark experiments with existing MLLMs are conducted with our CFpano. The experimental results demonstrate that \methodname achieves state-of-the-art performance across both multiple-choice and open-ended VQA tasks, outperforming strong baselines on all major reasoning categories (\textbf{+5.37\%} in overall performance). Our analyses validate the effectiveness of GRPO and establish a new benchmark for panoramic scene understanding.
Similar Papers
Dense360: Dense Understanding from Omnidirectional Panoramas
CV and Pattern Recognition
Lets computers see and understand everything around.
Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method
CV and Pattern Recognition
Helps computers understand 360° pictures better.
JoPano: Unified Panorama Generation via Joint Modeling
CV and Pattern Recognition
Makes 360-degree pictures from words or other pictures.