Score: 1

RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

Published: July 25, 2025 | arXiv ID: 2507.18889v1

By: Yinxiao Feng , Tiancheng Chen , Yuchen Wei and more

Potential Business Impact:

Connects many computer chips cheaply for AI.

Business Areas:

Communications Infrastructure Hardware

Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \textit{Rail-optimized} network are extremely expensive, while direct topologies such as \textit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose \textit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on \textit{Hamiltonian Decomposition} theory to organize separate rail-based rings into \textit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2\sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of \textit{RailX} is less than $10\%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50\%$ of the Fat-Tree. Specifically, only $\sim$\$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. \textit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.

Compute Can't Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure

Distributed, Parallel, and Cluster Computing

Builds super-fast AI by connecting computer parts better.

9 Jul 2025 0

87%

Photonic Rails in ML Datacenters

Networking and Internet Architecture

Makes computer training faster and cheaper.

10 Jul 2025 0

86%

Scalable and Efficient Intra- and Inter-node Interconnection Networks for Post-Exascale Supercomputers and Data centers

Hardware Architecture

Makes supercomputers faster by fixing data jams.

6 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇭 🇨🇳 Switzerland, China

Page Count

25 pages

RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

Connects many computer chips cheaply for AI.

Technical Abstract

Compute Can't Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure

Photonic Rails in ML Datacenters

Scalable and Efficient Intra- and Inter-node Interconnection Networks for Post-Exascale Supercomputers and Data centers