Dense360: Dense Understanding from Omnidirectional Panoramas
By: Yikang Zhou , Tao Zhang , Dizhe Zhang and more
Potential Business Impact:
Lets computers see and understand everything around.
Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.
Similar Papers
Omnidirectional Spatial Modeling from Correlated Panoramas
CV and Pattern Recognition
Helps robots understand 360° views better.
Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method
CV and Pattern Recognition
Helps computers understand 360° pictures better.
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
CV and Pattern Recognition
Tests if AI can understand 360-degree views.