Denoising Vision Transformer Autoencoder with Spectral Self-Regularization
By: Xunzhi Xiang , Xingye Tian , Guiyu Zhang and more
Potential Business Impact:
Makes AI create better pictures faster.
Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.
Similar Papers
Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
CV and Pattern Recognition
Makes AI create videos from words much faster.
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
CV and Pattern Recognition
Makes AI create better, more detailed pictures.
Variational decomposition autoencoding improves disentanglement of latent representations
Machine Learning (CS)
**Finds hidden patterns in sounds and body signals.**