ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training
By: Xueze Kang , Guangyu Xiang , Yuxin Wang and more
Potential Business Impact:
Keeps AI training running smoothly when computers break.
Large-scale LLM pretraining now runs across $10^5$--$10^6$ accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by $1.35\times$ over ReCycle and $1.60\times$ over TorchFT; communicator recovery completes within one second (up to $82\times/3.6\times$ faster than full/partial rebuilds); migration MTTR drops by as much as $51\%$; and convergence deviation is reduced by approximately $78\%$.
Similar Papers
ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training
Distributed, Parallel, and Cluster Computing
Keeps big AI models training when computers break.
ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training
Distributed, Parallel, and Cluster Computing
Keeps AI training running smoothly even when computers fail.
AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism
Distributed, Parallel, and Cluster Computing
Keeps AI running smoothly even if a part breaks.