Reimagining RDMA Through the Lens of ML
By: Ertza Warraich , Ali Imran , Annus Zulfiqar and more
Potential Business Impact:
Makes AI training much faster and more reliable.
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed inter-connects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3x, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults -- delivering a resilient, scalable transport tailored for ML at cluster scale.
Similar Papers
An Extensible Software Transport Layer for GPU Networking
Networking and Internet Architecture
Makes AI training much faster by fixing network problems.
RDMA Point-to-Point Communication for LLM Systems
Distributed, Parallel, and Cluster Computing
Makes AI models train and run faster.
DMA Collectives for Efficient ML Communication Offloads
Distributed, Parallel, and Cluster Computing
Makes AI learn faster and use less power.