ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads
By: Zhuorui Liu, Chen Zhang, Dawei Song
Potential Business Impact:
Makes AI understand long stories faster.
With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV cache. There is certain work aiming to optimize the memory footprint of KV cache, inspired by the observation that attention heads can be categorized into retrieval heads that are of great significance and streaming heads that are of less significance. Typically, identifying the streaming heads and and waiving the KV cache in the streaming heads would largely reduce the overhead without hurting the performance that much. However, since employing both retrieval and streaming heads in one layer decomposes one large round of attention computation into two small ones, it may unexpectedly bring extra latency on accessing and indexing tensors. Based on this intuition, we impose an important improvement to the identification process of retrieval and streaming heads, in which we design a criterion that enforces exclusively retrieval or streaming heads gathered in one unique layer. In this way, we further eliminate the extra latency and only incur negligible performance degradation. Our method named \textsc{ZigzagAttention} is competitive among considered baselines owing to reduced latency and comparable performance.
Similar Papers
Retrospective Sparse Attention for Efficient Long-Context Generation
Computation and Language
Fixes AI mistakes in long stories.
Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking
Computation and Language
Helps computers find important info in long texts.
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
Computation and Language
Makes AI understand long texts faster and cheaper.