QCaption: Video Captioning and Q&A through Fusion of Large Multimodal Models
By: Jiale Wang , Gee Wah Ng , Lee Onn Mak and more
This paper introduces QCaption, a novel video captioning and Q&A pipeline that enhances video analytics by fusing three models: key frame extraction, a Large Multimodal Model (LMM) for image-text analysis, and a Large Language Model (LLM) for text analysis. This approach enables integrated analysis of text, images, and video, achieving performance improvements over existing video captioning and Q&A models; all while remaining fully self-contained, adept for on-premises deployment. Experimental results using QCaption demonstrated up to 44.2% and 48.9% improvements in video captioning and Q&A tasks, respectively. Ablation studies were also performed to assess the role of LLM on the fusion on the results. Moreover, the paper proposes and evaluates additional video captioning approaches, benchmarking them against QCaption and existing methodologies. QCaption demonstrate the potential of adopting a model fusion approach in advancing video analytics.
Similar Papers
QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models
Artificial Intelligence
Lets computers understand long videos and sounds.
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
CV and Pattern Recognition
Makes computers judge video quality better, faster.
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
CV and Pattern Recognition
Helps computers understand movies by watching and reading.