Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
By: Nicholas Babey , Tiffany Gu , Yiheng Li and more
Potential Business Impact:
Teaches robots to understand actions by watching.
For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.
Similar Papers
Simple 3D Pose Features Support Human and Machine Social Scene Understanding
CV and Pattern Recognition
Helps computers understand how people interact.
Biomechanically consistent real-time action recognition for human-robot interaction
Robotics
Helps robots understand what people are doing.
Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach
CV and Pattern Recognition
Helps computers see people using objects.