Score: 0

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Published: December 13, 2025 | arXiv ID: 2512.12218v1

By: Rheeya Uppaal , Phu Mon Htut , Min Bai and more

Potential Business Impact:

Fixes AI's image-thinking mistakes for better answers.

Business Areas:

Visual Search Internet Services

Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

On the Faithfulness of Visual Thinking: Measurement and Enhancement

CV and Pattern Recognition

Makes AI understand pictures better for answers.

27 Oct 2025 1

91%

VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

CV and Pattern Recognition

Checks if AI truly sees what it's told.

13 Jun 2025 1

90%

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

CV and Pattern Recognition

Helps computers imagine "what if" in videos.

25 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

23 pages

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Fixes AI's image-thinking mistakes for better answers.

Technical Abstract

On the Faithfulness of Visual Thinking: Measurement and Enhancement

VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding