Score: 1

Motus: A Unified Latent Action World Model

Published: December 15, 2025 | arXiv ID: 2512.13030v1

By: Hongzhe Bi , Hengkai Tan , Shenghao Xie and more

Potential Business Impact:

Teaches robots to learn and do many tasks.

Business Areas:

Motion Capture Media and Entertainment, Video

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models

Robotics

Teaches robots to do new jobs with little practice.

28 Nov 2025 1

87%

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Robotics

Robot learns to do tasks by watching and thinking.

25 Nov 2025 1

87%

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Robotics

Teaches robots by watching videos, not just experts.

3 Apr 2025 0

View PDF Login to Bookmark

Page Count

22 pages

Motus: A Unified Latent Action World Model

Teaches robots to learn and do many tasks.

Technical Abstract

LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets