Score: 2

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Published: August 27, 2025 | arXiv ID: 2508.20265v1

By: Zhixiang Chi , Yanan Wu , Li Gu and more

Potential Business Impact:

Helps computers understand pictures and words better.

Business Areas:

Image Recognition Data and Analytics, Software

CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

CV and Pattern Recognition

Lets computers label picture parts with any words.

20 Nov 2025 0

90%

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

CV and Pattern Recognition

Makes computers understand pictures better for tasks.

27 Oct 2025 1

90%

Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning

CV and Pattern Recognition

Helps computers learn new things without forgetting old ones.

3 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Repos / Data Links

github.com

Page Count

14 pages

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Helps computers understand pictures and words better.

Technical Abstract

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning