Video Understanding by Design: How Datasets Shape Architectures and Insights
By: Lei Wang, Piotr Koniusz, Yongsheng Gao
Potential Business Impact:
Teaches computers to understand videos better.
Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.
Similar Papers
Causality Model for Semantic Understanding on Videos
CV and Pattern Recognition
Teaches computers to truly understand videos.
Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis
CV and Pattern Recognition
Tests how well computers understand 3D objects from pictures.
Data-Efficient Learning for Generalizable Surgical Video Understanding
Image and Video Processing
Helps doctors learn and improve surgery with AI.