Score: 0

Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Published: November 21, 2025 | arXiv ID: 2511.17358v1

By: Daniil Ignatev , Ayman Santeer , Albert Gatt and more

Potential Business Impact:

Makes computers understand words better by looking at pictures.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.