SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
By: Fida Mohammad Thoker , Letian Jiang , Chen Zhao and more
Potential Business Impact:
Computers learn to understand videos better.
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.
Similar Papers
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
CV and Pattern Recognition
Makes computers learn from videos without labels.
Robustness Evaluation for Video Models with Reinforcement Learning
CV and Pattern Recognition
Makes AI video watchers more easily fooled.
Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision
CV and Pattern Recognition
Makes videos look better without human help.