Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion
By: Huaize Liu , Wenzhang Sun , Qiyuan Zhang and more
Potential Business Impact:
Makes videos smaller without losing quality.
Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$\times$, almost 30$\times$ higher than baseline methods (e.g., Cosmos-VAE at 48$\times$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.
Similar Papers
H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models
CV and Pattern Recognition
Makes phone videos create super fast and good.
DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation
CV and Pattern Recognition
Makes videos smaller by separating key parts.
Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context
CV and Pattern Recognition
Makes videos smaller without losing quality.