ChronusOmni: Improving Time Awareness of Omni Large Language Models
By: Yijing Chen , Yihan Wu , Kaisi Guan and more
Potential Business Impact:
Helps computers understand videos by linking sound and sight.
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
Similar Papers
TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models
Sound
Helps computers understand exact moments in audio.
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
CV and Pattern Recognition
Lets computers understand and talk about videos.
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
Computation and Language
Helps AI understand tiny details in sound and video.