Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
By: Xisheng Feng
Potential Business Impact:
Helps computers see plants better, not guess.
Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.6% over Qwen-VL and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval
Similar Papers
VLMs Guided Interpretable Decision Making for Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars make safer, clearer choices.
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models
Artificial Intelligence
Makes AI understand pictures and facts better.
Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
Machine Learning (CS)
Helps AI remember facts from pictures faster.