Score: 0

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Published: November 28, 2025 | arXiv ID: 2511.23386v1

By: Sinan Du , Jiahao Guo , Bo Li and more

Potential Business Impact:

Lets computers understand and create images.

Business Areas:

Image Recognition Data and Analytics, Software

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

CV and Pattern Recognition

Makes AI create better, more realistic pictures.

10 Nov 2025 2

89%

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

CV and Pattern Recognition

Makes AI better at seeing and drawing pictures.

1 Apr 2025 0

89%

OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

CV and Pattern Recognition

Makes videos understandable for smart computer programs.

13 Aug 2025 1

View PDF Login to Bookmark

Page Count

19 pages

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Lets computers understand and create images.

Technical Abstract

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better