Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation
By: Jianming Liu, Wenlong Qiu, Haitao Wei
Potential Business Impact:
Teaches computers to recognize new things without seeing them.
Plain English Summary
Imagine you want a computer to identify and outline specific objects in photos, like a particular type of bird, but you only have a few examples. This new method helps the computer do that even if the new photos are very different from the examples it was trained on, without needing to send all the original training photos over. This means AI can be more accurate at recognizing new things in different situations, even when privacy or data limits are a concern.
Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.
Similar Papers
DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model
CV and Pattern Recognition
Teaches computers to recognize new things with few examples.
Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation
CV and Pattern Recognition
Helps computers learn new things with few examples.
Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining
CV and Pattern Recognition
Lets computers learn new things with less data.