One-Minute Video Generation with Test-Time Training
By: Karan Dalal , Daniel Koceja , Gashon Hussein and more
Potential Business Impact:
Makes computers create longer, better cartoon stories.
Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit
Similar Papers
Test-Time Training Provably Improves Transformers as In-context Learners
Machine Learning (CS)
Teaches computers to learn from fewer examples.
Video-T1: Test-Time Scaling for Video Generation
CV and Pattern Recognition
Makes videos better by thinking more after you ask.
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
CV and Pattern Recognition
Makes AI understand pictures and words faster.