Score: 0

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network

Published: October 14, 2025 | arXiv ID: 2510.12064v1

By: Jun Dai , Xiaorun Wang , Kexiong Fang and more

Potential Business Impact:

Trains giant AI models across many computer centers.

Business Areas:

Content Delivery Network Content and Publishing

The proliferation of Large Language Models (LLMs) with exponentially growing parameters is making cross-data center (DC) training an inevitable trend. However, viable strategies for extending single-DC training frameworks to multi-DC environments remain underdeveloped. We experimentally demonstrate, for the first time, a high-performance geo-distributed LLMs training framework across multiple DCs interconnected by a lossless, remote direct memory access (RDMA) enabled Datacenter Optical Transport Network (DC-OTN). An enhanced pipeline parallelism scheme is implemented within the Ascend full-stack environment of Huawei, which effectively eliminates the impact of cross-DC communication overhead on training efficiency. The overlapped computation and cross-DC communication is achieved with constraint cross-DC bandwidth and High Bandwidth Memory (HBM), reducing computation bubble ratio by up to 78.91%.

CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training

Distributed, Parallel, and Cluster Computing

Trains big computer brains faster across different places.

30 Jun 2025 1

87%

TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training

Machine Learning (CS)

Makes AI learn faster with less computer power.

12 Nov 2025 2

87%

CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

Systems and Control

Makes smart phones learn faster, use less power.

24 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

6 pages

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network

Trains giant AI models across many computer centers.

Technical Abstract

CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training

TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training

CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks