Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
By: Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi
Potential Business Impact:
Helps AI focus on objects, not just backgrounds.
Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.
Similar Papers
Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
CV and Pattern Recognition
Helps computers understand pictures and where things are.
Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
CV and Pattern Recognition
Helps computers understand pictures and words better.
Language-Guided Invariance Probing of Vision-Language Models
CV and Pattern Recognition
Tests if AI understands words that mean the same thing.