Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
By: Sara Sarto, Marcella Cornia, Rita Cucchiara
Potential Business Impact:
Helps computers describe pictures more accurately.
The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.
Similar Papers
Multi-LLM Collaborative Caption Generation in Scientific Documents
Computation and Language
Makes computer pictures tell better stories.
A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling
CV and Pattern Recognition
Helps computers understand pictures from different cultures.
LLM-Free Image Captioning Evaluation in Reference-Flexible Settings
CV and Pattern Recognition
Helps computers judge picture descriptions better.