Unleashing Hour-Scale Video Training for Long Video-Language Understanding
By: Jingyang Lin , Jialian Wu , Ximeng Sun and more
Potential Business Impact:
Lets computers understand hour-long videos.
Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.
Similar Papers
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
CV and Pattern Recognition
Helps computers understand long videos like movies.
Scaling RL to Long Videos
CV and Pattern Recognition
Lets computers understand long videos better.
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
CV and Pattern Recognition
Tests how well computers understand long videos.