Score: 1

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Published: June 3, 2025 | arXiv ID: 2506.02634v4

By: Jiahao Wang , Jinbo Han , Xingda Wei and more

Potential Business Impact:

Makes AI answer faster by remembering past answers.

Business Areas:

Cloud Computing Internet Services, Software

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV\$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV\$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV\$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Networking and Internet Architecture

Saves computer time and money by reusing text.

18 Mar 2025 0

89%

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse

Computation and Language

Makes AI understand long texts much faster.

17 Mar 2025 1

88%

KV Cache Compression for Inference Efficiency in LLMs: A Review

Distributed, Parallel, and Cluster Computing

Makes AI smarter and faster using less memory.

8 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

18 pages

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Makes AI answer faster by remembering past answers.

Technical Abstract

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse

KV Cache Compression for Inference Efficiency in LLMs: A Review