Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
By: Ziqiao Ma , Jing Ding , Xuejun Zhang and more
Potential Business Impact:
Helps computers describe pictures like people do.
Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.
Similar Papers
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
Computation and Language
Helps computers understand where things are.
Vision language models are unreliable at trivial spatial cognition
CV and Pattern Recognition
Computers struggle to tell what's left or right.
LVLMs are Bad at Overhearing Human Referential Communication
Computation and Language
Computers learn to understand what people are talking about.