Recurrent Cross-View Object Geo-Localization
By: Xiaohan Zhang , Si-Yuan Cao , Xiaokai Bai and more
Potential Business Impact:
Finds objects in satellite pictures using a point.
Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.
Similar Papers
Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
CV and Pattern Recognition
Helps find objects using pictures from different angles.
SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts
CV and Pattern Recognition
Find objects in satellite photos from drone pictures.
Referring Video Object Segmentation with Cross-Modality Proxy Queries
CV and Pattern Recognition
Helps computers find specific things in videos.