Generalizing Vision-Language Models with Dedicated Prompt Guidance
By: Xinyao Li , Yinjie Min , Hongbo Chen and more
Potential Business Impact:
Helps AI understand new things better.
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
Similar Papers
Infusing fine-grained visual knowledge to Vision-Language Models
CV and Pattern Recognition
Keeps AI smart while teaching new skills.
Visual Generation Tuning
CV and Pattern Recognition
Makes smart AI that can create pictures.
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
CV and Pattern Recognition
Helps computers see objects in new places.