Denoising Vision Transformer Autoencoder with Spectral Self-Regularization
By: Xunzhi Xiang , Xingye Tian , Guiyu Zhang and more
Potential Business Impact:
Makes AI create better pictures faster.
Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.
Similar Papers
Wavelet-based Variational Autoencoders for High-Resolution Image Generation
CV and Pattern Recognition
Makes computer pictures sharper and more detailed.
Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
CV and Pattern Recognition
Makes AI image makers create sharper, more detailed pictures.
Hyperspectral Variational Autoencoders for Joint Data Compression and Component Extraction
Machine Learning (CS)
Shrinks huge satellite pictures to share them faster.