Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
By: Xiyuan Wang, Muhan Zhang
Potential Business Impact:
Makes AI create pictures faster and better.
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
Similar Papers
From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model
CV and Pattern Recognition
Makes AI create detailed pictures much faster.
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference
Machine Learning (CS)
Makes AI write and understand faster.
DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning
CV and Pattern Recognition
Makes AI better at creating and understanding pictures.