Co-Evolving Latent Action World Models
By: Yucen Wang , Fengming Zhang , De-Chuan Zhan and more
Potential Business Impact:
Makes AI learn and control worlds better.
Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
Similar Papers
AdaWorld: Learning Adaptable World Models with Latent Actions
Artificial Intelligence
Teaches robots to learn new actions quickly.
Latent Action World Models for Control with Unlabeled Trajectories
Machine Learning (CS)
Teaches robots to learn from watching and doing.
Latent Action Pretraining Through World Modeling
Robotics
Teaches robots to do tasks from watching videos.