SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
By: Hao Du , Bo Wu , Yan Lu and more
Potential Business Impact:
Helps computers understand videos and words together.
Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.
Similar Papers
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
CV and Pattern Recognition
Helps computers talk about moving pictures instantly.
Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
CV and Pattern Recognition
Find bad things in videos using words.
DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment
CV and Pattern Recognition
Makes videos look better by understanding motion.