Simulating LLM training workloads for heterogeneous compute and network infrastructure
By: Sumit Kumar , Arjun Temura , Naman Sharma and more
Potential Business Impact:
Makes AI training faster on mixed computer parts.
The growing demand for large-scale GPU clusters in distributed model training presents a significant barrier to innovation, particularly in model optimization, performance tuning, and system-level enhancements. To address this challenge, LLM training simulators are employed to estimate training time and guide design decisions. However, the state-of-the-art LLM training simulators assume homogeneous compute and network infrastructure. In practice, device heterogeneity is inevitable due to resource sharing in cloud environments, frequent shifts in device generations, and inherent intra-chip interconnect heterogeneity. To address the gap between state-of-the-art and practical requirements, we propose the design of a heterogeneity-aware distributed LLM simulator capable of predicting training time while enabling abstractions to specify custom configurations for device groups and device-to-parallelism mapping. We present the design requirements and challenges in building a heterogeneity-aware distributed ML training simulator, and design components such as non-uniform workload partitioning. Our initial simulation results demonstrate the impact of heterogeneity on the model computation and communication time.
Similar Papers
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
Distributed, Parallel, and Cluster Computing
Tests how to make AI answer questions faster.
Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM
Distributed, Parallel, and Cluster Computing
Predicts computer learning time without needing supercomputers.
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
Distributed, Parallel, and Cluster Computing
Makes AI models train faster on many computers.