Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method
By: Xinshen Zhang, Zhen Ye, Xu Zheng
Potential Business Impact:
Helps computers understand 360° pictures better.
Omnidirectional images (ODIs), with their 360{\deg} field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360{\deg} imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).
Similar Papers
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
CV and Pattern Recognition
Tests if AI can understand 360-degree views.
ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
CV and Pattern Recognition
Helps computers understand 360-degree pictures better.
Dense360: Dense Understanding from Omnidirectional Panoramas
CV and Pattern Recognition
Lets computers see and understand everything around.