Efficient Context Scaling with LongCat ZigZag Attention
By: Chen Zhang , Yang Bai , Jiahuan Li and more
We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.
Similar Papers
Training-free Context-adaptive Attention for Efficient Long Context Modeling
Computation and Language
Makes AI understand long texts faster.
Attention and Compression is all you need for Controllably Efficient Language Models
Machine Learning (CS)
Lets computers remember more with less effort.
ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads
Computation and Language
Makes AI understand long stories faster.