Planning with Reasoning using Vision Language World Model
By: Delong Chen , Theo Moutakanni , Willy Chung and more
Potential Business Impact:
Helps robots understand and plan actions in videos.
Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.
Similar Papers
Planning with Reasoning using Vision Language World Model
Artificial Intelligence
Helps robots understand and plan tasks from videos.
WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation
CV and Pattern Recognition
Helps robots find things without bumping into them.
Latent Action Pretraining Through World Modeling
Robotics
Teaches robots to do tasks from watching videos.