Score: 0

Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

Published: September 1, 2025 | arXiv ID: 2509.01341v1

By: Yunus Serhat Bicakci, Joseph Shingleton, Anahid Basiri

Potential Business Impact:

Find exact locations from photos.

Business Areas:
Visual Search Internet Services

Street-level geolocalization from images is crucial for a wide range of essential applications and services, such as navigation, location-based recommendations, and urban planning. With the growing popularity of social media data and cameras embedded in smartphones, applying traditional computer vision techniques to localize images has become increasingly challenging, yet highly valuable. This paper introduces a novel approach that integrates open-weight and publicly accessible multimodal large language models with retrieval-augmented generation. The method constructs a vector database using the SigLIP encoder on two large-scale datasets (EMP-16 and OSV-5M). Query images are augmented with prompts containing both similar and dissimilar geolocation information retrieved from this database before being processed by the multimodal large language models. Our approach has demonstrated state-of-the-art performance, achieving higher accuracy compared against three widely used benchmark datasets (IM2GPS, IM2GPS3k, and YFCC4k). Importantly, our solution eliminates the need for expensive fine-tuning or retraining and scales seamlessly to incorporate new data sources. The effectiveness of retrieval-augmented generation-based multimodal large language models in geolocation estimation demonstrated by this paper suggests an alternative path to the traditional methods which rely on the training models from scratch, opening new possibilities for more accessible and scalable solutions in GeoAI.

Country of Origin
🇹🇷 Turkey

Page Count
13 pages

Category
Computer Science:
CV and Pattern Recognition