From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts
By: Weiran Li , Yeqiang Liu , Yijie Wei and more
Potential Business Impact:
Teaches computers to understand new things better.
Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.
Similar Papers
Multimodal Robust Prompt Distillation for 3D Point Cloud Models
CV and Pattern Recognition
Makes 3D shape recognition safer from tricks.
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
CV and Pattern Recognition
Teaches computers to understand pictures and words better.
LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions
CV and Pattern Recognition
Fixes computer vision bias in messy data.