SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
By: Jiayi Tian , Seyedarmin Azizi , Yequan Zhao and more
Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.
Similar Papers
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Computation and Language
Makes AI understand long stories faster.
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Machine Learning (CS)
Makes AI remember more without slowing down.
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Machine Learning (CS)
Makes AI remember more without getting slow.