VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
By: Silin Cheng, Kai Han
Potential Business Impact:
Teaches computers to understand pictures and words better.
Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp
Similar Papers
Modeling Variants of Prompts for Vision-Language Models
CV and Pattern Recognition
Makes AI understand pictures better with any words.
Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain Shift
CV and Pattern Recognition
Helps AI understand new things with few examples.
Medical Knowledge Intervention Prompt Tuning for Medical Image Classification
CV and Pattern Recognition
Helps AI understand medical images better.