Score: 2

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Published: January 5, 2026 | arXiv ID: 2601.02204v1

By: Huichao Zhang , Liao Qu , Yiheng Liu and more

Potential Business Impact:

Creates images and videos from text, super fast.

Business Areas:

Image Recognition Data and Analytics, Software

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

CV and Pattern Recognition

Creates amazing pictures from words.

14 Aug 2025 2

91%

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

CV and Pattern Recognition

Creates better pictures from words, faster.

14 Aug 2025 2

90%

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

CV and Pattern Recognition

Makes computers understand and create pictures better.

12 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com

Page Count

41 pages

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Creates images and videos from text, super fast.

Technical Abstract

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation