CaliDrop: KV Cache Compression with Calibration
By: Yi Su , Quantong Qiu , Yuechi Zhou and more
Potential Business Impact:
Saves computer memory for faster AI.
Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.
Similar Papers
CaliDrop: KV Cache Compression with Calibration
Computation and Language
Saves computer memory for faster AI.
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Machine Learning (CS)
Makes AI remember more without slowing down.
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Machine Learning (CS)
Makes AI remember more without getting slow.