Score: 0

FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

Published: September 3, 2025 | arXiv ID: 2509.03047v1

By: Haijun Zhang , Jinxiang Wang , Zhenhua Yu and more

Potential Business Impact:

Fixes AI training crashes in seconds.

Business Areas:

Flash Storage Hardware

Large language models (LLMs) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized communication group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration LLM training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.

FFTrainer: Fast Failover in Large-Language Model Training with Almost-Free State Management

Distributed, Parallel, and Cluster Computing

Saves computer training from crashing and speeds it up.

3 Dec 2025 0

87%

Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments

Distributed, Parallel, and Cluster Computing

Keeps AI working even when computers break.

15 Mar 2025 0

86%

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

Distributed, Parallel, and Cluster Computing

Keeps AI training running smoothly even when computers fail.

1 Oct 2025 2

View PDF Login to Bookmark

Page Count

11 pages

FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

Fixes AI training crashes in seconds.

Technical Abstract

FFTrainer: Fast Failover in Large-Language Model Training with Almost-Free State Management

Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training