Score: 1

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Published: May 19, 2025 | arXiv ID: 2505.12742v1

By: Jinhua Zhang , Wei Long , Minghao Han and more

Potential Business Impact:

Makes computers draw pictures much faster.

Business Areas:

Image Recognition Data and Analytics, Software

Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

CV and Pattern Recognition

Makes AI draw pictures faster and use less power.

28 Nov 2025 0

90%

Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

CV and Pattern Recognition

Makes pictures draw faster without losing detail.

28 Jul 2025 1

90%

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Machine Learning (CS)

Makes AI image creation use much less computer memory.

26 May 2025 4

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

16 pages

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Makes computers draw pictures much faster.

Technical Abstract

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression