Score: 0

Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

Published: October 19, 2025 | arXiv ID: 2510.16781v1

By: Shihao Ji, Zihui Song

Potential Business Impact:

Makes computers understand videos without extra training.

Business Areas:

Image Recognition Data and Analytics, Software

The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering

CV and Pattern Recognition

Teaches computers to learn from videos without labels.

29 Aug 2025 0

89%

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

CV and Pattern Recognition

Helps computers understand videos better by reading descriptions.

31 Oct 2025 0

89%

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

CV and Pattern Recognition

Helps computers understand actions from videos better.

31 Oct 2025 0

View PDF Login to Bookmark

Page Count

4 pages

Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

Makes computers understand videos without extra training.

Technical Abstract

Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes