Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
By: Joseph Kampeas, Emir Haleva
Potential Business Impact:
Makes AI chat faster by saving memory.
Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate-distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 $\times$ KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. In real LLM serving, joint encoding improves the token throughput by $\sim$40\% on a single-machine vLLM benchmark, demonstrating substantial gains in inference throughput. Code is available at https://github.com/sef1/kv_fast_fusion kv_joint_encoding.
Similar Papers
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Computation and Language
Makes AI remember more without using more memory.
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Machine Learning (CS)
Makes AI remember more without slowing down.
KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
Machine Learning (CS)
Makes AI remember more without using much memory.