Understanding Stragglers in Large Model Training Using What-if Analysis
By: Jinkun Lin , Ziheng Jiang , Zuquan Song and more
Potential Business Impact:
Fixes slow computers to train AI faster.
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
Similar Papers
Straggler Tolerant and Resilient DL Training on Homogeneous GPUs
Distributed, Parallel, and Cluster Computing
Makes computer training faster by fixing slow parts.
Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems
Distributed, Parallel, and Cluster Computing
Lets computers solve hard scheduling puzzles from words.
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
Distributed, Parallel, and Cluster Computing
Makes AI learn faster on many computers.