Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
By: Shihao Ji, Zihui Song
Potential Business Impact:
Makes computers understand videos without extra training.
The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.
Similar Papers
Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
CV and Pattern Recognition
Teaches computers to learn from videos without labels.
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
CV and Pattern Recognition
Helps computers understand videos better by reading descriptions.
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
CV and Pattern Recognition
Helps computers understand actions from videos better.