VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
By: Dominick Reilly , Manish Kumar Govind , Le Xue and more
Potential Business Impact:
Helps AI understand new things without forgetting old ones.
Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.
Similar Papers
Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models
CV and Pattern Recognition
Helps computers understand new pictures better.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
CV and Pattern Recognition
Lets computers see smarter, using less data.
Language-Guided Invariance Probing of Vision-Language Models
CV and Pattern Recognition
Tests if AI understands words that mean the same thing.