Score: 1

Measuring How (Not Just Whether) VLMs Build Common Ground

Published: September 4, 2025 | arXiv ID: 2509.03805v1

By: Saki Imai , Mert İnan , Anthony Sicilia and more

Potential Business Impact:

Tests how well AI understands and talks about pictures.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
12 pages

Category
Computer Science:
Computation and Language