ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
By: Yu Zhang , Yunqi Li , Yifan Yang and more
Potential Business Impact:
Makes AI draw pictures by thinking first.
Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.
Similar Papers
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
CV and Pattern Recognition
Makes AI create better pictures from words.
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning
CV and Pattern Recognition
Teaches computers to understand pictures better.
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation
Artificial Intelligence
Helps computers judge AI art quality better.