UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
By: Zhaolong Su , Wang Lu , Hao Chen and more
Potential Business Impact:
Makes AI better at understanding and creating things.
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
Similar Papers
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
CV and Pattern Recognition
Tests how well AI can see and create.
90% Faster, 100% Code-Free: MLLM-Driven Zero-Code 3D Game Development
Artificial Intelligence
Builds 3D games from just your words.
UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning
CV and Pattern Recognition
Creates and changes pictures from words.