Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$
By: Jiangning Zhang , Junwei Zhu , Teng Hu and more
Potential Business Impact:
Makes super clear videos much faster to create.
Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video
Similar Papers
FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation
CV and Pattern Recognition
Makes videos clearer and bigger without retraining.
ViT$^3$: Unlocking Test-Time Training in Vision
CV and Pattern Recognition
Makes computers understand pictures faster and better.
MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction
CV and Pattern Recognition
Fixes wobbly 3D pictures from moving cameras.