Robust LLM Training Infrastructure at ByteDance
By: Borui Wan , Gaohong Liu , Zuquan Song and more
Potential Business Impact:
Keeps giant computer brains training without stopping.
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
Similar Papers
Role-Based Fault Tolerance System for LLM RL Post-Training
Distributed, Parallel, and Cluster Computing
Keeps AI learning even when computers break.
BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training
Machine Learning (CS)
Makes AI models start training 50% faster.
Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM
Distributed, Parallel, and Cluster Computing
Predicts computer learning time without needing supercomputers.