Score: 1

System-performance and cost modeling of Large Language Model training and inference

Published: July 3, 2025 | arXiv ID: 2507.02456v1

By: Wenzhe Guo , Joyjit Kundu , Uras Tos and more

Potential Business Impact:

Makes big AI models train and run cheaper.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs), based on transformer architectures, have revolutionized numerous domains within artificial intelligence, science, and engineering due to their exceptional scalability and adaptability. However, the exponential growth in LLM size and complexity has outpaced advancements in compute capacity, memory bandwidth, network performance, and cost efficiency, posing significant challenges to their scalability on distributed systems. To address these limitations, alternative model architectures, optimization strategies, communication-aware network topologies, and novel system design approaches have been proposed in literature. This paper introduces a performance-cost modeling methodology for LLM training and inference that integrates state-of-the-art compute techniques with memory optimizations, and latest communication techniques. Building on an analytical performance model, our approach incorporates recent innovations such as the flash attention technique and mixture of experts models to address the memory bandwidth and compute bottlenecks. It also considers the impact of different network topologies and topology-specific communication algorithms with 5D parallellism. The framework also integrates a chiplet cost model. The proposed modeling methodology provides valuable insights to guide future compute system design and facilitates hardware-software co-development, in particular due to its ability to analyze performance-cost trade-offs for various system architectural configurations.

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Computation and Language

Makes AI smarter and faster to use.

13 Aug 2025 1

92%

Characterizing Communication Patterns in Distributed Large Language Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI talk faster by fixing how computers share info.

18 Jul 2025 0

92%

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Distributed, Parallel, and Cluster Computing

Predicts computer learning time without needing supercomputers.

26 Sep 2025 0

View PDF Login to Bookmark

Page Count

11 pages

System-performance and cost modeling of Large Language Model training and inference

Makes big AI models train and run cheaper.

Technical Abstract

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Characterizing Communication Patterns in Distributed Large Language Model Inference

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM