Score: 0

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network

Published: October 14, 2025 | arXiv ID: 2510.12064v1

By: Jun Dai , Xiaorun Wang , Kexiong Fang and more

Potential Business Impact:

Trains giant AI models across many computer centers.

Business Areas:
Content Delivery Network Content and Publishing

The proliferation of Large Language Models (LLMs) with exponentially growing parameters is making cross-data center (DC) training an inevitable trend. However, viable strategies for extending single-DC training frameworks to multi-DC environments remain underdeveloped. We experimentally demonstrate, for the first time, a high-performance geo-distributed LLMs training framework across multiple DCs interconnected by a lossless, remote direct memory access (RDMA) enabled Datacenter Optical Transport Network (DC-OTN). An enhanced pipeline parallelism scheme is implemented within the Ascend full-stack environment of Huawei, which effectively eliminates the impact of cross-DC communication overhead on training efficiency. The overlapped computation and cross-DC communication is achieved with constraint cross-DC bandwidth and High Bandwidth Memory (HBM), reducing computation bubble ratio by up to 78.91%.

Country of Origin
🇨🇳 China

Page Count
6 pages

Category
Computer Science:
Networking and Internet Architecture