FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
By: Xuan Ju , Weicai Ye , Quande Liu and more
Potential Business Impact:
Makes videos from many ideas at once.
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
Similar Papers
FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers
CV and Pattern Recognition
Makes video creation faster and easier.
Bidirectional Sparse Attention for Faster Video Diffusion Training
CV and Pattern Recognition
Makes video creation faster and cheaper.
Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
CV and Pattern Recognition
Makes videos with many scenes from text.