Score: 0

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Published: September 26, 2025 | arXiv ID: 2509.22832v1

By: Biyao Zhang , Mingkai Zheng , Debargha Ganguly and more

Potential Business Impact:

Predicts computer learning time without needing supercomputers.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98\% on Perlmutter(A100) and 9.38\% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.

System-performance and cost modeling of Large Language Model training and inference

Hardware Architecture

Makes big AI models train and run cheaper.

3 Jul 2025 1

91%

Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework

Machine Learning (CS)

Trains AI models faster and uses less energy.

2 Oct 2025 0

91%

Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems

Distributed, Parallel, and Cluster Computing

Lets computers solve hard scheduling puzzles from words.

4 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

11 pages

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Predicts computer learning time without needing supercomputers.

Technical Abstract

System-performance and cost modeling of Large Language Model training and inference

Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework

Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems