GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
By: Sahiti Yerramilli , Nilay Pande , Rynaa Grover and more
Potential Business Impact:
Teaches computers to understand maps and locations.
This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.
Similar Papers
Grounded Chain-of-Thought for Multimodal Large Language Models
CV and Pattern Recognition
Makes AI understand pictures without making things up.
GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes
CV and Pattern Recognition
Teaches computers to understand maps without human help.
Do Large Language Models Truly Understand Geometric Structures?
Computation and Language
Teaches computers to understand shapes and space.