Score: 0

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Published: March 18, 2025 | arXiv ID: 2503.14647v1

By: Hanchen Li , Yuhan Liu , Yihua Cheng and more

Potential Business Impact:

Saves computer time and money by reusing text.

Business Areas:

Cloud Computing Internet Services, Software

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including colocating stored KV caches with (or close to) GPUs to various KV cache compression. However, a key question remains unanswered: can these delay reductions also be economically favorable? Specifically, we ask whether a developer can use public cloud services to store precomputed KV caches and reuse them to save delay without incurring more costs in terms of compute, storage, and network. To answer this question, we propose an validated analytical model for the cloud cost (in compute, storage, and network) of storing and reusing KV caches based on various workload parameters, such as reuse frequency, generated text lengths, model sizes, etc. Preliminary results show that KV cache reusing is able to save both delay and cloud cost across a range of workloads with long context. And we call more efforts on building more economical context augmented LLM by KV cache reusing.

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse

Computation and Language

Makes AI understand long texts much faster.

17 Mar 2025 1

90%

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Distributed, Parallel, and Cluster Computing

Makes AI answer faster by remembering past answers.

3 Jun 2025 1

89%

KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs

Machine Learning (CS)

Reuses old computer thoughts to make new ones faster.

4 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

2 pages

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Saves computer time and money by reusing text.

Technical Abstract

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs