LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
By: Zhongchun Zhou, Chengtao Lai, Wei Zhang
Potential Business Impact:
Makes AI models run much faster on computers.
Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.
Similar Papers
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
Hardware Architecture
Makes AI remember more by using faster memory.
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
Hardware Architecture
Makes AI remember more without slowing down.
DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management
Hardware Architecture
Makes AI faster by sharing computer memory.