Unlocking Text Capabilities in Vision Models
By: Fawaz Sammani, Jonas Fischer, Nikos Deligiannis
Potential Business Impact:
Lets computers explain what pictures show.
Visual classifiers provide high-dimensional feature representations that are challenging to interpret and analyze. Text, in contrast, provides a more expressive and human-friendly interpretable medium for understanding and analyzing model behavior. We propose a simple, yet powerful method for reformulating any pretrained visual classifier so that it can be queried with free-form text without compromising its original performance. Our approach is label-free, data and compute-efficient, and is trained to preserve the underlying classifiers distribution and decision-making processes. Our method unlocks several zero-shot text interpretability applications for any visual classifier. We apply our method on 40 visual classifiers and demonstrate two primary applications: 1) building both label-free and zero-shot concept bottleneck models and therefore converting any visual classifier to be inherently-interpretable and 2) zero-shot decoding of visual features into natural language sentences. In both tasks we establish new state-of-the-art results, outperforming existing works and surpassing CLIP-based baselines with ImageNet-only trained classifiers, while using up to 400x fewer images and 400,000x less text during training.
Similar Papers
Zero-Shot Textual Explanations via Translating Decision-Critical Features
CV and Pattern Recognition
Explains why computers see what they see.
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
CV and Pattern Recognition
Finds pictures using only words, not images.
Unifying Vision-Language Latents for Zero-label Image Caption Enhancement
CV and Pattern Recognition
Helps computers describe pictures without seeing labels.