When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
By: Rhea Kapur, Robert Hawkins, Elisa Kreiss
Potential Business Impact:
Makes picture descriptions shorter and smarter.
Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
Similar Papers
An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
CV and Pattern Recognition
Makes AI talk too much, wasting time and money.
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
CV and Pattern Recognition
Teaches computers to understand long picture descriptions.
Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor
CV and Pattern Recognition
Helps computers understand pictures better with words.