LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication
By: Prajwal Singhania , Siddharth Singh , Lannie Dalton Hough and more
Potential Business Impact:
Makes giant AI models run much faster.
As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.
Similar Papers
Characterizing Communication Patterns in Distributed Large Language Model Inference
Distributed, Parallel, and Cluster Computing
Makes AI talk faster by fixing how computers share info.
Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks
Distributed, Parallel, and Cluster Computing
Lets phones run smart AI without the internet.
Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU
Distributed, Parallel, and Cluster Computing
Makes AI training much faster using more computer parts.