One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
By: Junha Song , Yongsik Jo , So Yeon Min and more
Potential Business Impact:
Lets phones describe pictures without the internet.
Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.
Similar Papers
Generating Accurate and Detailed Captions for High-Resolution Images
CV and Pattern Recognition
Makes computer pictures describe more details accurately.
Large Language Models Facilitate Vision Reflection in Image Classification
CV and Pattern Recognition
Helps AI understand pictures by using words.
Multilingual Training-Free Remote Sensing Image Captioning
CV and Pattern Recognition
Lets computers describe satellite pictures in any language.