Score: 1

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Published: May 12, 2025 | arXiv ID: 2505.07538v3

By: Bohan Wang , Zhongqi Yue , Fengda Zhang and more

Potential Business Impact:

Teaches computers to understand and create pictures like words.

Business Areas:

Autonomous Vehicles Transportation

We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

CV and Pattern Recognition

Makes AI create better pictures from words.

6 Oct 2025 1

88%

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

CV and Pattern Recognition

Makes AI better at understanding and creating pictures.

18 Sep 2025 1

88%

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

CV and Pattern Recognition

Creates detailed pictures from simple ideas.

16 Oct 2025 1

View PDF Login to Bookmark

Page Count

42 pages

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Teaches computers to understand and create pictures like words.

Technical Abstract

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation