Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM
By: Biyao Zhang , Mingkai Zheng , Debargha Ganguly and more
Potential Business Impact:
Predicts computer learning time without needing supercomputers.
Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98\% on Perlmutter(A100) and 9.38\% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.
Similar Papers
Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework
Machine Learning (CS)
Trains AI models faster and uses less energy.
Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems
Distributed, Parallel, and Cluster Computing
Lets computers solve hard scheduling puzzles from words.
Characterizing Communication Patterns in Distributed Large Language Model Inference
Distributed, Parallel, and Cluster Computing
Makes AI talk faster by fixing how computers share info.