CBA: Communication-Bound-Aware Cross-Domain Resource Assignment for Pipeline-Parallel Distributed LLM Training in Dynamic Multi-DC Optical Networks
By: Dianxuan Fu , Xiaomin Liu , Yihao Zhang and more
We propose a communication-bound-aware cross-domain resource assignment framework for pipeline-parallel distributed training over multi-datacenter optical networks, which lowers iteration time by 31.25% and reduces 13.20% blocking requests compared to baselines.
Similar Papers
First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An LLM-Powered Multi-AI-Agent Solution
Multiagent Systems
AI agents learn faster to control networks.
GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
Networking and Internet Architecture
Trains giant AI models across many computer centers.
Communication-Computation Pipeline Parallel Split Learning over Wireless Edge Networks
Distributed, Parallel, and Cluster Computing
Speeds up AI learning by sharing tasks smartly.