Score: 2

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Published: November 17, 2025 | arXiv ID: 2511.13259v1

By: Yushuo Zheng , Jiangyong Ying , Huiyu Duan and more

Potential Business Impact:

Helps robots and cars know where they are.

Business Areas:

Geospatial Data and Analytics, Navigation and Mapping

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

CV and Pattern Recognition

Maps understand pictures, words, and places together.

19 Dec 2025 1

89%

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

CV and Pattern Recognition

Tests if AI understands pictures, sound, and words equally.

16 Oct 2025 0

89%

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

CV and Pattern Recognition

Tests AI on guessing photo locations

3 Aug 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

9 pages

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Helps robots and cars know where they are.

Technical Abstract

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models