Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap
By: Shagnik Pal , Shaizeen Aga , Suchita Pati and more
Potential Business Impact:
Makes AI learn much faster by better sharing work.
As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model across GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of data-dependent communication and computation operations where communication is exposed, leaving as high as 1.7x ideal performance on the table. Prior works harness the fact that ML model state and inputs are already sharded, and employ careful overlap of individual computation/communication shards. While such coarse-grain overlap is promising, in this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO, where we argue for finer-granularity, one-level deeper overlap than at shard-level, to unlock compute/communication overlap for a wider set of network topologies, finer-grain dataflow and more. We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. At the same time, decomposition of ML operations into smaller operations (done in both shard-based and finer-grain techniques) causes operation-level inefficiency losses. To balance the two, we first present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures. Doing so helps us design heuristics that frameworks and runtimes can harness to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed bespoke schedules deliver up to 1.6x speedup and our heuristics provide accurate guidance in 81% of unseen scenarios.
Similar Papers
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation
Distributed, Parallel, and Cluster Computing
Makes AI models train much faster on computers.
Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation
Distributed, Parallel, and Cluster Computing
Makes AI learn faster across far-away computers.
Characterizing Communication Patterns in Distributed Large Language Model Inference
Distributed, Parallel, and Cluster Computing
Makes AI talk faster by fixing how computers share info.