RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation
By: Fu Rong , Meng Lan , Qian Zhang and more
Potential Business Impact:
Helps computers find things in satellite pictures using words.
Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text descriptions. To address these issues, we propose RS2-SAM2, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features, providing pseudo-mask-based dense prompts, and enforcing boundary constraints. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.
Similar Papers
Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation
CV and Pattern Recognition
Finds objects in pictures from words.
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
CV and Pattern Recognition
Teaches computers to perfectly outline things in satellite pictures.
Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation
CV and Pattern Recognition
Lets computers find things in pictures using words.