AttriPrompt: Dynamic Prompt Composition Learning for CLIP
By: Qiqi Zhan , Shiwei Li , Qingjie Liu and more
Potential Business Impact:
Helps computers understand pictures better.
The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37\% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.
Similar Papers
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
CV and Pattern Recognition
Helps AI remember old lessons when learning new ones.
VSC: Visual Search Compositional Text-to-Image Diffusion Model
CV and Pattern Recognition
Makes AI draw pictures with many details correctly.
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
CV and Pattern Recognition
Finds exact pictures for text descriptions.