LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
By: Manlai Liang , JiaMing Zhang , Xiong Li and more
Potential Business Impact:
Makes AI remember more without getting slow.
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.
Similar Papers
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Machine Learning (CS)
Makes AI remember more without slowing down.
CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
Computation and Language
Makes AI remember more without slowing down.
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Computation and Language
Makes AI understand long texts faster, using less memory.