RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems
By: Yinxiao Feng , Tiancheng Chen , Yuchen Wei and more
Potential Business Impact:
Connects many computer chips cheaply for AI.
Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \textit{Rail-optimized} network are extremely expensive, while direct topologies such as \textit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose \textit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on \textit{Hamiltonian Decomposition} theory to organize separate rail-based rings into \textit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2\sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of \textit{RailX} is less than $10\%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50\%$ of the Fat-Tree. Specifically, only $\sim$\$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. \textit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.
Similar Papers
Compute Can't Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure
Distributed, Parallel, and Cluster Computing
Builds super-fast AI by connecting computer parts better.
Photonic Rails in ML Datacenters
Networking and Internet Architecture
Makes computer training faster and cheaper.
Scalable and Efficient Intra- and Inter-node Interconnection Networks for Post-Exascale Supercomputers and Data centers
Hardware Architecture
Makes supercomputers faster by fixing data jams.