Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
By: Pol G. Recasens , Ferran Agullo , Yue Zhu and more
Potential Business Impact:
Makes AI models run faster and use less power.
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models. The code is publicly available at https://github.com/FerranAgulloLopez/vLLMBatchingMemoryGap.
Similar Papers
Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
Distributed, Parallel, and Cluster Computing
Makes AI models run faster and use less memory.
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
Machine Learning (CS)
Makes AI think much faster by using smart memory.
Hardware-based Heterogeneous Memory Management for Large Language Model Inference
Hardware Architecture
Makes AI models run faster on less memory.