Score: 2

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Published: September 25, 2025 | arXiv ID: 2509.20912v1

By: Tianrun Xu , Haoda Jing , Ye Li and more

Potential Business Impact:

Teaches AI to explain answers using only real picture parts.

Business Areas:

Visual Search Internet Services

Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of "thinking with images," which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

CV and Pattern Recognition

Helps computers imagine "what if" in videos.

25 Nov 2025 1

90%

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

CV and Pattern Recognition

Finds fake pictures made by computers.

24 Sep 2025 2

90%

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

CV and Pattern Recognition

Helps computers understand videos by asking questions.

12 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com huggingface.co

Page Count

14 pages

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Teaches AI to explain answers using only real picture parts.

Technical Abstract

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation