Score: 2

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Published: October 8, 2025 | arXiv ID: 2510.06590v1

By: Ziyuan Huang , DanDan Zheng , Cheng Zou and more

Potential Business Impact:

Makes computers understand and create pictures better.

Business Areas:

Image Recognition Data and Analytics, Software

Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

CV and Pattern Recognition

Lets computers understand and create pictures.

6 Apr 2025 2

91%

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Computation and Language

Lets you edit spoken words with just your voice.

26 Oct 2025 2

90%

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

CV and Pattern Recognition

Makes computers create and change pictures from words.

5 May 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

22 pages

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Makes computers understand and create pictures better.

Technical Abstract

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction