REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder
By: Yitian Zhang , Long Mai , Aniruddha Mahapatra and more
Potential Business Impact:
Makes videos smaller for faster creation.
We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
Similar Papers
Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation
CV and Pattern Recognition
Makes AI create pictures and videos much faster.
Fast Autoregressive Video Generation with Diagonal Decoding
CV and Pattern Recognition
Makes videos generate 10x faster.
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
CV and Pattern Recognition
Makes videos look better with simpler training.