Score: 0

Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation

Published: April 24, 2025 | arXiv ID: 2504.17672v1

By: Ying Zhu , Yang Xu , Hongli Xu and more

Potential Business Impact:

Makes AI learn faster across far-away computers.

Business Areas:
Content Delivery Network Content and Publishing

Training large language models (LLMs) requires massive computational resources, often necessitating the aggregation of geographically distributed data centers (\ie, cross-region training). However, the high communication latency in wide-area networks severely degrades the efficiency of traditional distributed training. While methods like DiLoCo reduce communication frequency, they suffer from blocking synchronization. Streaming DiLoCo alleviates this issue via communication-computation overlapping but introduces update staleness and model inconsistency due to delayed global updates and partial synchronization. These factors impair convergence, especially when aggressive overlap is needed to mask high latency. We propose CoCoDC, a novel distributed training framework with communication-computation overlapping and delay compensation, to explicitly tackle these challenges. Within the CoCoDC framework, we specifically develop a novel Delay Compensation strategy based on Taylor expansion to effectively mitigate the staleness and an Adaptive Transmission strategy that dynamically schedules model fragment synchronization to optimize bandwidth usage and accelerate convergence. Extensive experiments highlight the superior performance of CoCoDC over both DiLoCo and Streaming DiLoCo regarding final accuracy and training speed. Specifically, CoCoDC reduces the training steps needed to reach a comparable perplexity by up to 21.0% compared to Streaming DiLoCo. Our work provides an effective solution for scalable and efficient cross-region LLM training.

Country of Origin
🇨🇳 China

Page Count
7 pages

Category
Computer Science:
Distributed, Parallel, and Cluster Computing