Score: 0

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Published: September 25, 2025 | arXiv ID: 2509.20979v1

By: Peng Chen , Jiaji Zhang , Hailiang Zhao and more

Potential Business Impact:

Makes AI models run much faster and smoother.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.

Inferring Causal Relationships to Improve Caching for Clients with Correlated Requests: Applications to VR

Networking and Internet Architecture

Makes VR games load faster by predicting what you'll need.

9 Dec 2025 0

87%

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Hardware Architecture

Makes AI models run much faster on computers.

26 Nov 2025 0

87%

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Hardware Architecture

Makes AI answer questions much faster.

19 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

18 pages

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Makes AI models run much faster and smoother.

Technical Abstract

Inferring Causal Relationships to Improve Caching for Clients with Correlated Requests: Applications to VR

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management