Score: 0

SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

Published: December 9, 2025 | arXiv ID: 2512.08881v1

By: Aysim Toker , Andreea-Maria Oncescu , Roy Miles and more

Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.

GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

CV and Pattern Recognition

Finds tiny things in big satellite pictures.

2 Dec 2025 1

92%

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

CV and Pattern Recognition

Teaches computers to understand 3D space from pictures.

26 Nov 2025 1

92%

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

CV and Pattern Recognition

Helps computers understand 3D space from pictures.

26 Nov 2025 1

View PDF Login to Bookmark

SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

Technical Abstract

GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning