Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
By: Shizhan Liu , Xinran Deng , Zhuoyi Yang and more
Potential Business Impact:
Makes AI create videos from words much faster.
Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.
Similar Papers
Denoising Vision Transformer Autoencoder with Spectral Self-Regularization
CV and Pattern Recognition
Makes AI create better pictures faster.
Distribution Matching Variational AutoEncoder
CV and Pattern Recognition
Makes AI draw better pictures by changing how it learns.
Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression
CV and Pattern Recognition
Makes videos smaller for faster streaming.