Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
By: Peng Chen , Jiaji Zhang , Hailiang Zhao and more
Potential Business Impact:
Makes AI models run much faster and smoother.
In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.
Similar Papers
Inferring Causal Relationships to Improve Caching for Clients with Correlated Requests: Applications to VR
Networking and Internet Architecture
Makes VR games load faster by predicting what you'll need.
LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
Hardware Architecture
Makes AI models run much faster on computers.
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Hardware Architecture
Makes AI answer questions much faster.