Score: 1

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Published: November 12, 2025 | arXiv ID: 2511.09557v2

By: Prajwal Singhania , Siddharth Singh , Lannie Dalton Hough and more

Potential Business Impact:

Makes giant AI models run much faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

Characterizing Communication Patterns in Distributed Large Language Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI talk faster by fixing how computers share info.

18 Jul 2025 0

88%

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Distributed, Parallel, and Cluster Computing

Lets phones run smart AI without the internet.

19 Mar 2025 0

88%

Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU

Distributed, Parallel, and Cluster Computing

Makes AI training much faster using more computer parts.

18 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

14 pages

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Makes giant AI models run much faster.

Technical Abstract

Characterizing Communication Patterns in Distributed Large Language Model Inference

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU