Bridging Your Imagination with Audio-Video Generation via a Unified Director
By: Jiaxu Zhang , Tianshu Hu , Yuan Zhang and more
Potential Business Impact:
Makes AI create movies from your ideas.
Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
Similar Papers
UniVideo: Unified Understanding, Generation, and Editing for Videos
CV and Pattern Recognition
Makes videos from words, pictures, and edits them.
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
CV and Pattern Recognition
Creates talking, moving characters from text.
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
CV and Pattern Recognition
Makes computers see and create pictures from words.