Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
By: Hong Zhang , Zhongjie Duan , Xingjun Wang and more
Potential Business Impact:
Makes computers understand, create, and change pictures.
Unified multimodal generative models aim to integrate image understanding and generation abilities, offering significant advantages in harnessing multimodal corpora, particularly interleaved text-image data. However, existing unified models exhibit limitations in image synthesis quality, autoregressive error accumulation, and image editing capability. In this work, we propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space. This shared space serves as a bridge for the autoregressive and diffusion models, which seamlessly integrates their complementary strengths in cross-modal modeling. To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy that aligns training-inference dynamics by prefilling input sequences with learnable embeddings. After multi-stage and multi-task training on our constructed large-scale dataset with 26.3 million samples, Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks. All models, datasets, and source codes are released in https://github.com/modelscope/Nexus-Gen to facilitate further advancements across the field.
Similar Papers
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
CV and Pattern Recognition
Lets computers understand and create images together.
MMGen: Unified Multi-modal Image Generation and Understanding in One Go
CV and Pattern Recognition
Creates pictures and understands them together.
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
CV and Pattern Recognition
Makes computers create and understand pictures and words.