Score: 2

Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation

Published: August 7, 2025 | arXiv ID: 2508.05213v1

By: Jianming Liu, Wenlong Qiu, Haitao Wei

Potential Business Impact:

Teaches computers to recognize new things without seeing them.

Plain English Summary

Imagine you want a computer to identify and outline specific objects in photos, like a particular type of bird, but you only have a few examples. This new method helps the computer do that even if the new photos are very different from the examples it was trained on, without needing to send all the original training photos over. This means AI can be more accurate at recognizing new things in different situations, even when privacy or data limits are a concern.

Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.