Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
By: Mykola Vysotskyi , Zahar Kohut , Mariia Shpir and more
Potential Business Impact:
Removes unwanted images from AI art generators.
Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations show that (i) per-step critics and (ii) noisy-conditioned rewards are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.
Similar Papers
Few-Shot Concept Unlearning with Low Rank Adaptation
Machine Learning (CS)
Removes unwanted images from AI art generators.
Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models
Machine Learning (CS)
Removes unwanted data from AI without retraining.
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
CV and Pattern Recognition
Makes AI pictures match words better.