VideoNSA: Native Sparse Attention Scales Video Understanding
By: Enxin Song , Wenhao Chai , Shusheng Yang and more
Potential Business Impact:
Lets computers watch and understand long videos better.
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
Similar Papers
Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies
Computation and Language
Makes computers understand long stories better.
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Computation and Language
Makes AI understand long stories better, faster.
TabNSA: Native Sparse Attention for Efficient Tabular Data Learning
Machine Learning (CS)
Helps computers learn from messy data faster.