MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning
By: Jinhua Zhang , Wei Long , Minghao Han and more
Potential Business Impact:
Makes computers draw pictures much faster.
Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.
Similar Papers
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
CV and Pattern Recognition
Makes AI draw pictures faster and use less power.
Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis
CV and Pattern Recognition
Makes pictures draw faster without losing detail.
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Machine Learning (CS)
Makes AI image creation use much less computer memory.