Score: 0

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Published: December 29, 2025 | arXiv ID: 2512.23646v1

By: Keda Tao , Wenjie Du , Bohan Yu and more

Potential Business Impact:

Lets computers understand sounds and sights together better.

Business Areas:

Autonomous Vehicles Transportation

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Artificial Intelligence

Lets computers understand many things together, like pictures and words.

4 Nov 2025 1

92%

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Artificial Intelligence

Lets computers understand all kinds of information together.

4 Nov 2025 1

90%

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

CV and Pattern Recognition

Lets computers understand and talk about videos.

15 Oct 2025 2

View PDF Login to Bookmark

Page Count

11 pages

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Lets computers understand sounds and sights together better.

Technical Abstract

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue