VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
By: Shikun Sun , Liao Qu , Huichao Zhang and more
Potential Business Impact:
Teaches AI to create better pictures faster.
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
Similar Papers
Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
CV and Pattern Recognition
Makes AI draw pictures better and in new styles.
Diversity Has Always Been There in Your Visual Autoregressive Models
CV and Pattern Recognition
Makes AI create more varied and interesting pictures.
AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
CV and Pattern Recognition
Makes AI create better, more realistic pictures.