Score: 0

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Published: October 31, 2025 | arXiv ID: 2510.27255v1

By: Yehna Kim andYoung-Eun Kim, Seong-Whan Lee

Potential Business Impact:

Helps computers understand actions from videos better.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

CV and Pattern Recognition

Helps computers understand videos better by reading descriptions.

31 Oct 2025 0

91%

Evaluation of Vision-LLMs in Surveillance Video

CV and Pattern Recognition

Helps computers spot unusual things in videos.

27 Oct 2025 1

91%

Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization

CV and Pattern Recognition

Helps computers understand what's happening in videos.

6 Sep 2025 1

View PDF Login to Bookmark

Page Count

28 pages

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Helps computers understand actions from videos better.

Technical Abstract

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Evaluation of Vision-LLMs in Surveillance Video

Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization