Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation
By: Leideng Shi, Juan Zhang
Potential Business Impact:
Lets computers find things in pictures using words.
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.
Similar Papers
A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark
CV and Pattern Recognition
Helps computers find objects in satellite pictures.
RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation
CV and Pattern Recognition
Helps computers find things in satellite pictures using words.
Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network
CV and Pattern Recognition
Helps computers see tiny things in pictures.