FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation
By: Ke Hong , Xiuhong Li , Minxu Liu and more
Potential Business Impact:
Makes AI models train much faster on computers.
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency is an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism. FlashOverlap utilizes a novel signaling mechanism to identify tile-wise data dependency without interrupting the computation process, and reorders data to contiguous addresses, enabling communication by simply calling NCCL APIs. Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases.
Similar Papers
Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap
Distributed, Parallel, and Cluster Computing
Makes AI learn much faster by better sharing work.
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
Distributed, Parallel, and Cluster Computing
Makes computer learning faster by mixing tasks.
FLASH: Fast All-to-All Communication in GPU Clusters
Distributed, Parallel, and Cluster Computing
Makes computer data sharing much faster.