SIGMA: An AI-Empowered Training Stack on Early-Life Hardware
By: Lei Qu , Lianhai Ren , Peng Cheng and more
Potential Business Impact:
Makes AI training faster and more reliable.
An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.
Similar Papers
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
Distributed, Parallel, and Cluster Computing
Creates fake computer runs to test AI.
Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs
Distributed, Parallel, and Cluster Computing
Builds fake computer runs to test AI.
AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies
Hardware Architecture
Finds best computer chips for AI tasks.