Pixels to Play: A Foundation Model for 3D Gameplay
By: Yuguang Yue , Chris Green , Samuel Hunt and more
Potential Business Impact:
Lets computers play many video games like people.
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.
Similar Papers
Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
CV and Pattern Recognition
Makes virtual objects bend and break realistically.
World Simulation with Video Foundation Models for Physical AI
CV and Pattern Recognition
Creates realistic worlds from text, images, or video.
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
CV and Pattern Recognition
Makes computers draw clearer, faster pictures.