Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
By: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo
Potential Business Impact:
Teaches computers to learn from few pictures.
Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
Similar Papers
Advancing Cache-Based Few-Shot Classification via Patch-Driven Relational Gated Graph Attention
CV and Pattern Recognition
Helps computers learn new things from few examples.
Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners
CV and Pattern Recognition
Helps AI learn from very few pictures.
Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
CV and Pattern Recognition
Helps computers understand pictures and words better.