GDRO: Group-level Reward Post-training Suitable for Diffusion Models
By: Yiyang Wang , Xi Chen , Xiaogang Xu and more
Potential Business Impact:
Makes AI art follow your exact instructions better.
Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.
Similar Papers
Flow-GRPO: Training Flow Matching Models via Online RL
CV and Pattern Recognition
Makes AI pictures match words perfectly.
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Computation and Language
Makes AI learn many things at once better.
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
Machine Learning (CS)
Trains AI to make better pictures much faster.