Photonic Rails in ML Datacenters
By: Eric Ding, Chuhan Ouyang, Rachee Singh
Potential Business Impact:
Makes computer training faster and cheaper.
Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power, cost, and complexity overheads. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support only one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We introduce parallelism-driven rail reconfiguration as a solution that leverages the sequential ordering between traffic from different parallelisms. We design a control plane, Opus, to enable time-multiplexed emulation of electrical rail switches using optical switches. More broadly, our work discusses a new research agenda: datacenter fabrics that co-evolve with the model parallelism dimensions within each job, as opposed to the prevailing mindset of reconfiguring networks before a job begins.
Similar Papers
Chip-to-chip photonic connectivity in multi-accelerator servers for ML
Networking and Internet Architecture
Makes AI learn much faster on many computers.
Panel-Scale Reconfigurable Photonic Interconnects for Scalable AI Computation
Systems and Control
Connects computer parts with light, faster and cheaper.
RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems
Hardware Architecture
Connects many computer chips cheaply for AI.