FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
By: Haijun Zhang , Jinxiang Wang , Zhenhua Yu and more
Potential Business Impact:
Fixes AI training crashes in seconds.
Large language models (LLMs) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized communication group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration LLM training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.
Similar Papers
FFTrainer: Fast Failover in Large-Language Model Training with Almost-Free State Management
Distributed, Parallel, and Cluster Computing
Saves computer training from crashing and speeds it up.
Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
Distributed, Parallel, and Cluster Computing
Keeps AI working even when computers break.
ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training
Distributed, Parallel, and Cluster Computing
Keeps AI training running smoothly even when computers fail.