MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
By: Chonghan Liu , Haoran Wang , Felix Henry and more
Potential Business Impact:
Helps computers understand how things are placed.
Spatial perception and reasoning are core components of human cognition, encompassing object recognition, spatial relational understanding, and dynamic reasoning. Despite progress in computer vision, existing benchmarks reveal significant gaps in models' abilities to accurately recognize object attributes and reason about spatial relationships, both essential for dynamic reasoning. To address these limitations, we propose MIRAGE, a multi-modal benchmark designed to evaluate models' capabilities in Counting (object attribute recognition), Relation (spatial relational reasoning), and Counting with Relation. Through diverse and complex scenarios requiring fine-grained recognition and reasoning, MIRAGE highlights critical limitations in state-of-the-art models, underscoring the need for improved representations and reasoning frameworks. By targeting these foundational abilities, MIRAGE provides a pathway toward spatiotemporal reasoning in future research.
Similar Papers
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
CV and Pattern Recognition
Helps computers understand spaces like humans do.
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
CV and Pattern Recognition
Helps computers understand places like we do.
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
CV and Pattern Recognition
Computers still struggle to understand space.