Score: 2

Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval

Published: January 26, 2026 | arXiv ID: 2601.18190v1

By: Yifan Li, Shiying Wang, Jianqiang Huang

Potential Business Impact:

Finds specific places on maps using words.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which captures global context and long-range dependencies with minimal overhead. Furthermore, a Multi-Perspective Representation (MPR) module aggregates these local cues into robust multi-perspective embeddings. The framework is optimized via a hybrid objective combining multi-perspective contrastive and weighted triplet losses, which dynamically selects maximum-response perspectives to suppress noise and enforce precise semantic matching. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall (mR), respectively, significantly outperforming full fine-tuning baselines and recent competitive methods. Code is available at https://github.com/Lcrucial1f/MPS-CLIP.

LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

CV and Pattern Recognition

Helps computers understand satellite pictures better.

25 Mar 2025 1

92%

FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

CV and Pattern Recognition

Helps computers understand satellite pictures better.

18 Nov 2025 3

91%

MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP

CV and Pattern Recognition

Lets computers understand Earth pictures using words.

13 Jan 2026 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

7 pages

Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval

Finds specific places on maps using words.

Technical Abstract

LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP