Physical Autoregressive Model for Robotic Manipulation without Action Pretraining
By: Zijian Song , Sihan Qin , Tianshui Chen and more
Potential Business Impact:
Teaches robots to move by watching videos.
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/
Similar Papers
Physical Autoregressive Model for Robotic Manipulation without Action Pretraining
CV and Pattern Recognition
Teaches robots to move by watching videos.
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Robotics
Teaches robots to do tasks by watching people.
ViPRA: Video Prediction for Robot Actions
Robotics
Teaches robots to move by watching videos.