Score: 1

See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops

Published: August 25, 2025 | arXiv ID: 2508.17932v1

By: Zixuan Dong , Baoyun Peng , Yufei Wang and more

Potential Business Impact:

Boosts video question answers by smart focusing.

Business Areas:

Computer Vision Hardware, Software

Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details. However, current long-form video question answering systems employ rigid pipelines that decouple reasoning from perception, leading to either information loss through premature visual abstraction or computational inefficiency through exhaustive processing. The core limitation lies in the inability to adapt visual extraction to specific reasoning requirements, different queries demand fundamentally different visual evidence from the same video content. In this work, we present CAVIA, a training-free framework that revolutionizes video understanding through reasoning, perception coordination. Unlike conventional approaches where visual processing operates independently of reasoning, CAVIA creates a closed-loop system where reasoning continuously guides visual extraction based on identified information gaps. CAVIA introduces three innovations: (1) hierarchical reasoning, guided localization to precise frames; (2) cross-modal semantic bridging for targeted extraction; (3) confidence-driven iterative synthesis. CAVIA achieves state-of-the-art performance on challenging benchmarks: EgoSchema (65.7%, +5.3%), NExT-QA (76.1%, +2.6%), and IntentQA (73.8%, +6.9%), demonstrating that dynamic reasoning-perception coordination provides a scalable paradigm for video understanding.

CAViAR: Critic-Augmented Video Agentic Reasoning

CV and Pattern Recognition

Lets computers understand long, tricky videos better.

9 Sep 2025 0

89%

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

CV and Pattern Recognition

Helps computers understand videos like people do.

18 Nov 2025 0

89%

Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning

CV and Pattern Recognition

Helps AI "see" and "think" about pictures better.

27 Nov 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

14 pages

See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops

Boosts video question answers by smart focusing.

Technical Abstract

CAViAR: Critic-Augmented Video Agentic Reasoning

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning