Score: 0

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Published: November 26, 2025 | arXiv ID: 2512.00083v1

By: Zhongchun Zhou, Chengtao Lai, Wei Zhang

Potential Business Impact:

Makes AI models run much faster on computers.

Business Areas:

RISC Hardware

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Hardware Architecture

Makes AI remember more by using faster memory.

17 Aug 2025 0

89%

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Hardware Architecture

Makes AI remember more without slowing down.

17 Aug 2025 0

89%

DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management

Hardware Architecture

Makes AI faster by sharing computer memory.

8 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

10 pages

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Makes AI models run much faster on computers.

Technical Abstract

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management