Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
By: Junshu Tang , Jiacheng Liu , Jiaqi Li and more
Potential Business Impact:
Makes game worlds react to your spoken commands.
Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".
Similar Papers
MagicWorld: Interactive Geometry-driven Video World Exploration
CV and Pattern Recognition
Creates stable, evolving worlds from your words.
Yan: Foundational Interactive Video Generation
CV and Pattern Recognition
Creates videos you can change with words.
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
CV and Pattern Recognition
Makes videos that change instantly with your actions.