GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
By: Jun Dai , Xiaorun Wang , Kexiong Fang and more
Potential Business Impact:
Trains giant AI models across many computer centers.
The proliferation of Large Language Models (LLMs) with exponentially growing parameters is making cross-data center (DC) training an inevitable trend. However, viable strategies for extending single-DC training frameworks to multi-DC environments remain underdeveloped. We experimentally demonstrate, for the first time, a high-performance geo-distributed LLMs training framework across multiple DCs interconnected by a lossless, remote direct memory access (RDMA) enabled Datacenter Optical Transport Network (DC-OTN). An enhanced pipeline parallelism scheme is implemented within the Ascend full-stack environment of Huawei, which effectively eliminates the impact of cross-DC communication overhead on training efficiency. The overlapped computation and cross-DC communication is achieved with constraint cross-DC bandwidth and High Bandwidth Memory (HBM), reducing computation bubble ratio by up to 78.91%.
Similar Papers
CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
Distributed, Parallel, and Cluster Computing
Trains big computer brains faster across different places.
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training
Machine Learning (CS)
Makes AI learn faster with less computer power.
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
Systems and Control
Makes smart phones learn faster, use less power.