Score: 1

Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization

Published: September 6, 2025 | arXiv ID: 2509.05695v1

By: Jingwei Peng , Zhixuan Qiu , Boyu Jin and more

Potential Business Impact:

Helps computers understand what's happening in videos.

Business Areas:

Image Recognition Data and Analytics, Software

Human action recognition often struggles with deep semantic understanding, complex contextual information, and fine-grained distinction, limitations that traditional methods frequently encounter when dealing with diverse video data. Inspired by the remarkable capabilities of large language models, this paper introduces LVLM-VAR, a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition, emphasizing enhanced accuracy and interpretability. Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent "semantic action tokens," effectively crafting an "action narrative" that is comprehensible to an LVLM. These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM (e.g., LLaVA-13B) for robust action classification and semantic reasoning. LVLM-VAR not only achieves state-of-the-art or highly competitive performance on challenging benchmarks such as NTU RGB+D and NTU RGB+D 120, demonstrating significant improvements (e.g., 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set), but also substantially boosts model interpretability by generating natural language explanations for its predictions.

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

CV and Pattern Recognition

Helps computers understand actions in videos better.

21 Aug 2025 0

91%

Vision-Language Models Unlock Task-Centric Latent Actions

Machine Learning (CS)

Teaches robots to ignore distractions and learn better.

30 Jan 2026 0

91%

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

CV and Pattern Recognition

Helps computers understand videos better by reading descriptions.

31 Oct 2025 0

View PDF Login to Bookmark

Page Count

14 pages

Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization

Helps computers understand what's happening in videos.

Technical Abstract

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

Vision-Language Models Unlock Task-Centric Latent Actions

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes