Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
By: Soumya Shamarao Jahagirdar, Jayasree Saha, C V Jawahar
Potential Business Impact:
Helps computers understand long videos without humans.
Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.
Similar Papers
Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Artificial Intelligence
Lets you edit long videos easily with words.
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Information Retrieval
Helps video apps understand what you *really* like.
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
CV and Pattern Recognition
Lets computers understand long videos without seeing them all.