Score: 0

RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics

Published: April 2, 2025 | arXiv ID: 2504.02069v1

By: Zhiyuan Zhang , Yuxin He , Yong Sun and more

Potential Business Impact:

Teaches robots to do tasks by watching videos.

Business Areas:

Image Recognition Data and Analytics, Software

Visual Language Models (VLMs) have emerged as pivotal tools for robotic systems, enabling cross-task generalization, dynamic environmental interaction, and long-horizon planning through multimodal perception and semantic reasoning. However, existing open-source VLMs predominantly trained for generic vision-language alignment tasks fail to model temporally correlated action semantics that are crucial for robotic manipulation effectively. While current image-based fine-tuning methods partially adapt VLMs to robotic applications, they fundamentally disregard temporal evolution patterns in video sequences and suffer from visual feature entanglement between robotic agents, manipulated objects, and environmental contexts, thereby limiting semantic decoupling capability for atomic actions and compromising model generalizability.To overcome these challenges, this work presents RoboAct-CLIP with dual technical contributions: 1) A dataset reconstruction framework that performs semantic-constrained action unit segmentation and re-annotation on open-source robotic videos, constructing purified training sets containing singular atomic actions (e.g., "grasp"); 2) A temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture, which disentangles temporal action features across video frames from object-centric characteristics to achieve hierarchical representation learning of robotic atomic actions.Experimental results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline VLMs, along with superior generalization in multi-object manipulation tasks.

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

CV and Pattern Recognition

Helps computers understand pictures with many things.

25 Nov 2025 1

90%

Generalist Robot Manipulation beyond Action Labeled Data

Robotics

Robots learn new tasks from watching videos.

24 Sep 2025 0

89%

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Robotics

Teaches robots to do tasks by watching people.

24 Oct 2025 0

View PDF Login to Bookmark

Page Count

7 pages

RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics

Teaches robots to do tasks by watching videos.

Technical Abstract

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Generalist Robot Manipulation beyond Action Labeled Data

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos