Score: 1

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

Published: December 11, 2025 | arXiv ID: 2512.10236v1

By: Shagnik Pal , Shaizeen Aga , Suchita Pati and more

BigTech Affiliations: AMD

Potential Business Impact:

Makes AI learn much faster by better sharing work.

Business Areas:

Cloud Computing Internet Services, Software

As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model across GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of data-dependent communication and computation operations where communication is exposed, leaving as high as 1.7x ideal performance on the table. Prior works harness the fact that ML model state and inputs are already sharded, and employ careful overlap of individual computation/communication shards. While such coarse-grain overlap is promising, in this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO, where we argue for finer-granularity, one-level deeper overlap than at shard-level, to unlock compute/communication overlap for a wider set of network topologies, finer-grain dataflow and more. We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. At the same time, decomposition of ML operations into smaller operations (done in both shard-based and finer-grain techniques) causes operation-level inefficiency losses. To balance the two, we first present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures. Doing so helps us design heuristics that frameworks and runtimes can harness to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed bespoke schedules deliver up to 1.6x speedup and our heuristics provide accurate guidance in 81% of unseen scenarios.

FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

Distributed, Parallel, and Cluster Computing

Makes AI models train much faster on computers.

28 Apr 2025 0

88%

Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation

Distributed, Parallel, and Cluster Computing

Makes AI learn faster across far-away computers.

24 Apr 2025 0

87%

Characterizing Communication Patterns in Distributed Large Language Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI talk faster by fixing how computers share info.

18 Jul 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

13 pages

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

Makes AI learn much faster by better sharing work.

Technical Abstract

FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation

Characterizing Communication Patterns in Distributed Large Language Model Inference