A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition
By: Xiuliang Zhang, Tadiwa Elisha Nyamasvisva, Chuntao Liu
Potential Business Impact:
Helps computers understand actions in videos better.
Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.
Similar Papers
Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models
CV and Pattern Recognition
Predicts future video frames more accurately.
A Lightweight 3D-CNN for Event-Based Human Action Recognition with Privacy-Preserving Potential
CV and Pattern Recognition
Recognizes actions without showing faces.
Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics
CV and Pattern Recognition
Identifies people by how they move and stand.