Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics
By: Jungwoo Kim , Minsang Kim , Jaeheon Lee and more
Potential Business Impact:
Makes AI answer questions faster and cheaper.
Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$\times$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.
Similar Papers
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Machine Learning (CS)
Smartly reuses AI answers to save time.
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Machine Learning (CS)
Smartly reuses answers to make AI faster.
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Distributed, Parallel, and Cluster Computing
Makes AI answer questions much faster.