Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU
By: Michael Adams, Amanda Bienz
Potential Business Impact:
Makes AI training much faster using more computer parts.
Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This paper presents novel optimizations to large GPU-aware all-reduce operations, extending lane-aware reductions to the GPUs, and notably using multiple CPU cores per GPU to accelerate these operations. These multi-CPU-accelerated GPU-aware lane all-reduces yield speedup of up to $2.45$x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer. Finally, the approach is extended to NVIDIA's and AMD's collective communication libraries, achieving speedup of up to $1.77$x and $1.71$x, respectively, across $2$ state-of-the-art supercomputers.
Similar Papers
LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication
Distributed, Parallel, and Cluster Computing
Makes giant AI models run much faster.
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Distributed, Parallel, and Cluster Computing
Makes AI run much faster on many computers.
Managing Multi Instance GPUs for High Throughput and Energy Savings
Distributed, Parallel, and Cluster Computing
Makes computer chips run much faster and better.