Score: 0

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Published: October 21, 2025 | arXiv ID: 2510.18692v1

By: Weinan Jia , Yuning Lu , Mengqi Huang and more

Potential Business Impact:

Makes computers create long videos much faster.

Business Areas:

MMO Games Gaming

Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

Mixture of Contexts for Long Video Generation

Graphics

Makes videos remember stories for minutes.

28 Aug 2025 0

89%

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Computation and Language

Makes AI remember more without using more computer power.

16 Jun 2025 2

89%

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

CV and Pattern Recognition

Makes long videos create faster without losing quality.

18 Aug 2025 0

View PDF Login to Bookmark

Page Count

15 pages

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Makes computers create long videos much faster.

Technical Abstract

Mixture of Contexts for Long Video Generation

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation