Score: 1

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Published: November 25, 2025 | arXiv ID: 2511.19923v1

By: Yuefei Chen , Jiang Liu , Xiaodong Lin and more

Potential Business Impact:

Helps computers imagine "what if" in videos.

Business Areas:

Image Recognition Data and Analytics, Software

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

CV and Pattern Recognition

Teaches computers to watch videos better.

8 Jan 2026 1

92%

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

CV and Pattern Recognition

Helps computers understand videos by asking questions.

12 Mar 2025 1

92%

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

CV and Pattern Recognition

Teaches computers to understand pictures better.

7 Jul 2025 0

View PDF Login to Bookmark

Page Count

21 pages

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Helps computers imagine "what if" in videos.

Technical Abstract

CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets