Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
By: Daniil Ignatev , Ayman Santeer , Albert Gatt and more
Potential Business Impact:
Makes computers understand words better by looking at pictures.
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
Similar Papers
Towards Understanding Visual Grounding in Visual Language Models
CV and Pattern Recognition
Helps computers understand what's in pictures.
Towards Understanding Visual Grounding in Visual Language Models
CV and Pattern Recognition
Helps computers understand what's in pictures.
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
CV and Pattern Recognition
Finds things in satellite pictures using words.