KRAST: Knowledge-Augmented Robotic Action Recognition with Structured Text for Vision-Language Models
By: Son Hai Nguyen , Diwei Wang , Jinhyeok Jang and more
Potential Business Impact:
Helps robots see and understand what people do.
Accurate vision-based action recognition is crucial for developing autonomous robots that can operate safely and reliably in complex, real-world environments. In this work, we advance video-based recognition of indoor daily actions for robotic perception by leveraging vision-language models (VLMs) enriched with domain-specific knowledge. We adapt a prompt-learning framework in which class-level textual descriptions of each action are embedded as learnable prompts into a frozen pre-trained VLM backbone. Several strategies for structuring and encoding these textual descriptions are designed and evaluated. Experiments on the ETRI-Activity3D dataset demonstrate that our method, using only RGB video inputs at test time, achieves over 95\% accuracy and outperforms state-of-the-art approaches. These results highlight the effectiveness of knowledge-augmented prompts in enabling robust action recognition with minimal supervision.
Similar Papers
LaVA-Man: Learning Visual Action Representations for Robot Manipulation
Robotics
Robots learn to grab things by looking and reading.
Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization
CV and Pattern Recognition
Helps computers understand what's happening in videos.
Continually Evolving Skill Knowledge in Vision Language Action Model
Robotics
Robots learn new skills without constant retraining.