Score: 0

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Published: April 7, 2025 | arXiv ID: 2504.04704v2

By: Manlai Liang , JiaMing Zhang , Xiong Li and more

Potential Business Impact:

Makes AI remember more without getting slow.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Machine Learning (CS)

Makes AI remember more without slowing down.

3 Aug 2025 1

91%

CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

Computation and Language

Makes AI remember more without slowing down.

4 Aug 2025 3

91%

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Computation and Language

Makes AI understand long texts faster, using less memory.

19 Feb 2025 2

View PDF Login to Bookmark

Page Count

14 pages

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Makes AI remember more without getting slow.

Technical Abstract

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression