QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models
By: Zixing Lin , Jiale Wang , Gee Wah Ng and more
Potential Business Impact:
Lets computers understand long videos and sounds.
Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applica- tions in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like Vide- oLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like PerceptionTest and EgoSchema saw up to 2% improvement, indicating competitive performance. Qualitative experiments also showed that QMAVIS is able to extract the nuances of different scenes in a long video audio content while understanding the overarching narrative. Ablation studies were also conducted to ascertain the impact of each component in the fusion pipeline.
Similar Papers
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
Multimedia
Tests AI that understands videos and sounds together.
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Artificial Intelligence
Helps computers answer questions about long videos.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand who speaks in videos.