MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization
By: Rizhen Hu , Yutong He , Ran Yan and more
Potential Business Impact:
Keeps AI training running when computers break.
As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources. To address this challenge, we propose Memory- and Computation-efficient Fault-tolerant Optimization (MeCeFO), a novel algorithm that ensures robust training with minimal overhead. When a computing node fails, MeCeFO seamlessly transfers its training task to a neighboring node while employing memory- and computation-efficient algorithmic optimizations to minimize the extra workload imposed on the neighboring node handling both tasks. MeCeFO leverages three key algorithmic designs: (i) Skip-connection, which drops the multi-head attention (MHA) module during backpropagation for memory- and computation-efficient approximation; (ii) Recomputation, which reduces activation memory in feedforward networks (FFNs); and (iii) Low-rank gradient approximation, enabling efficient estimation of FFN weight matrix gradients. Theoretically, MeCeFO matches the convergence rate of conventional distributed training, with a rate of $\mathcal{O}(1/\sqrt{nT})$, where n is the data parallelism size and T is the number of iterations. Empirically, MeCeFO maintains robust performance under high failure rates, incurring only a 4.18% drop in throughput, demonstrating 5.0$\times$ to 6.7$\times$ greater resilience than previous SOTA approaches. Codes are available at https://github.com/pkumelon/MeCeFO.
Similar Papers
MoFa: A Unified Performance Modeling Framework for LLM Pretraining
Distributed, Parallel, and Cluster Computing
Finds best way to train giant AI brains faster.
MoFa: A Unified Performance Modeling Framework for LLM Pretraining
Distributed, Parallel, and Cluster Computing
Finds best ways to train giant AI brains faster.
Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
Distributed, Parallel, and Cluster Computing
Makes smart computer brains learn faster on weak computers.