Score: 2

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

Published: September 19, 2025 | arXiv ID: 2509.15940v1

By: Guoliang He , Youhe Jiang , Wencong Xiao and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Makes AI learn faster on many computers.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The scaling law for large language models (LLMs) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over thousands of computing nodes. However, LLM pre-training presents unique challenges due to its complex communication patterns, where GPUs exchange data in sparse yet high-volume bursts within specific groups. Inefficient resource scheduling exacerbates bandwidth contention, leading to suboptimal training performance. This paper presents Arnold, a scheduling system summarizing our experience to effectively align LLM communication patterns with data center topology at scale. An in-depth characteristic study is performed to identify the impact of physical network topology to LLM pre-training jobs. Based on the insights, we develop a scheduling algorithm to effectively align communication patterns with the physical network topology in modern data centers. Through simulation experiments, we show the effectiveness of our algorithm in reducing the maximum spread of communication groups by up to $1.67$x. In production training, our scheduling system improves the end-to-end performance by $10.6\%$ when training with more than $9600$ GPUs, a significant improvement for our training pipeline.

Characterizing Communication Patterns in Distributed Large Language Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI talk faster by fixing how computers share info.

18 Jul 2025 0

89%

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

Distributed, Parallel, and Cluster Computing

Makes AI learn faster on many computers.

12 Sep 2025 1

88%

System-performance and cost modeling of Large Language Model training and inference

Hardware Architecture

Makes big AI models train and run cheaper.

3 Jul 2025 1

View PDF Login to Bookmark

Country of Origin

🇬🇧 🇭🇰 🇨🇳 Hong Kong, United Kingdom, China

Page Count

18 pages

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

Makes AI learn faster on many computers.

Technical Abstract

Characterizing Communication Patterns in Distributed Large Language Model Inference

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

System-performance and cost modeling of Large Language Model training and inference