Score: 3

Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space

Published: April 30, 2025 | arXiv ID: 2504.21356v3

By: Hong Zhang , Zhongjie Duan , Xingjun Wang and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Makes computers understand, create, and change pictures.

Business Areas:

Image Recognition Data and Analytics, Software

Unified multimodal generative models aim to integrate image understanding and generation abilities, offering significant advantages in harnessing multimodal corpora, particularly interleaved text-image data. However, existing unified models exhibit limitations in image synthesis quality, autoregressive error accumulation, and image editing capability. In this work, we propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space. This shared space serves as a bridge for the autoregressive and diffusion models, which seamlessly integrates their complementary strengths in cross-modal modeling. To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy that aligns training-inference dynamics by prefilling input sequences with learnable embeddings. After multi-stage and multi-task training on our constructed large-scale dataset with 26.3 million samples, Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks. All models, datasets, and source codes are released in https://github.com/modelscope/Nexus-Gen to facilitate further advancements across the field.

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

CV and Pattern Recognition

Lets computers understand and create images together.

5 May 2025 1

88%

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

CV and Pattern Recognition

Creates pictures and understands them together.

26 Mar 2025 0

88%

Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

CV and Pattern Recognition

Makes computers create and understand pictures and words.

17 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

14 pages

Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space

Makes computers understand, create, and change pictures.

Technical Abstract

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Unified Autoregressive Visual Generation and Understanding with Continuous Tokens