Probabilistic Temporal Masked Attention for Cross-view Online Action Detection
By: Liping Xie , Yang Tan , Shicheng Jing and more
Potential Business Impact:
Helps computers understand actions in videos better.
As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.
Similar Papers
Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection
CV and Pattern Recognition
Teaches computers to understand video motion better.
CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework
CV and Pattern Recognition
Teaches computers to see faster and better.
Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding
CV and Pattern Recognition
Helps computers understand what people will do next.