Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
By: Yuxiang Ji , Yong Wang , Ziyu Ma and more
Potential Business Impact:
Helps computers find where pictures were taken.
The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.
Similar Papers
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
CV and Pattern Recognition
Helps computers find places on Earth using pictures.
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models
CV and Pattern Recognition
Helps computers find places from any picture.
Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
Computation and Language
Finds where a picture was taken using smart thinking.