SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding
By: Nianbo Zeng , Haowen Hou , Fei Richard Yu and more
Potential Business Impact:
Helps computers understand long videos by breaking them into scenes.
Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.
Similar Papers
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
CV and Pattern Recognition
Helps computers understand long videos better.
E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation
CV and Pattern Recognition
Makes computers understand long videos faster and better.
A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models
Computation and Language
Helps computers understand complex topics better.