Score: 1

Simulating LLM training workloads for heterogeneous compute and network infrastructure

Published: August 7, 2025 | arXiv ID: 2508.05370v1

By: Sumit Kumar , Arjun Temura , Naman Sharma and more

Potential Business Impact:

Makes AI training faster on mixed computer parts.

The growing demand for large-scale GPU clusters in distributed model training presents a significant barrier to innovation, particularly in model optimization, performance tuning, and system-level enhancements. To address this challenge, LLM training simulators are employed to estimate training time and guide design decisions. However, the state-of-the-art LLM training simulators assume homogeneous compute and network infrastructure. In practice, device heterogeneity is inevitable due to resource sharing in cloud environments, frequent shifts in device generations, and inherent intra-chip interconnect heterogeneity. To address the gap between state-of-the-art and practical requirements, we propose the design of a heterogeneity-aware distributed LLM simulator capable of predicting training time while enabling abstractions to specify custom configurations for device groups and device-to-parallelism mapping. We present the design requirements and challenges in building a heterogeneity-aware distributed ML training simulator, and design components such as non-uniform workload partitioning. Our initial simulation results demonstrate the impact of heterogeneity on the model computation and communication time.

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

Distributed, Parallel, and Cluster Computing

Tests how to make AI answer questions faster.

10 Nov 2025 1

88%

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Distributed, Parallel, and Cluster Computing

Predicts computer learning time without needing supercomputers.

26 Sep 2025 0

88%

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

Distributed, Parallel, and Cluster Computing

Makes AI models train faster on many computers.

12 Sep 2025 1

View PDF Login to Bookmark

Page Count

9 pages

Simulating LLM training workloads for heterogeneous compute and network infrastructure

Makes AI training faster on mixed computer parts.

Technical Abstract

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective