Top-Down Semantic Refinement for Image Captioning
By: Jusheng Zhang , Kaitong Cai , Jing Yang and more
Potential Business Impact:
Makes AI describe pictures with more detail.
Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.
Similar Papers
STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
CV and Pattern Recognition
Helps self-driving cars understand traffic better.
Feedback-Driven Vision-Language Alignment with Minimal Human Supervision
CV and Pattern Recognition
Makes AI understand pictures better with less work.
Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
CV and Pattern Recognition
Lets computers see and think better.