Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
By: Rheeya Uppaal , Phu Mon Htut , Min Bai and more
Potential Business Impact:
Fixes AI's image-thinking mistakes for better answers.
Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.
Similar Papers
On the Faithfulness of Visual Thinking: Measurement and Enhancement
CV and Pattern Recognition
Makes AI understand pictures better for answers.
VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?
CV and Pattern Recognition
Checks if AI truly sees what it's told.
CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding
CV and Pattern Recognition
Helps computers imagine "what if" in videos.