Score: 0

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Published: December 8, 2025 | arXiv ID: 2512.07993v1

By: Jiayi Tian , Seyedarmin Azizi , Yequan Zhao and more

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Computation and Language

Makes AI understand long stories faster.

1 Apr 2025 2

90%

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Machine Learning (CS)

Makes AI remember more without slowing down.

3 Aug 2025 1

90%

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Machine Learning (CS)

Makes AI remember more without getting slow.

7 Apr 2025 0

View PDF Login to Bookmark

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Technical Abstract

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important