OptiNIC: A Resilient and Tail-Optimal RDMA NIC for Distributed ML Workloads
By: Ertza Warraich , Ali Imran , Annus Zulfiqar and more
Potential Business Impact:
Speeds up AI training by fixing slow computer messages.
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by high-speed interconnects, tail latency in collective communication has become a major bottleneck. Existing RDMA transports, such as RoCE, IRN, SRNIC, and Falcon, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While these approaches work well for general-purpose workloads, they introduce complexity and latency that scale poorly in ML, where even rare packet delays can stall entire model pipelines. We present OptiNIC, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for partial or missing data. OptiNIC eliminates retransmissions and in-order delivery from the NIC, enabling a best-effort, out-of-order transport model for RDMA. Unlike traditional RDMA, which signals completion only after complete data delivery, OptiNIC introduces adaptive timeouts to trigger forward progress when data may be lost or delayed. OptiNIC retains standard congestion control mechanisms (e.g., DCQCN, EQDS, or Swift) while shifting loss recovery to the ML pipeline itself (e.g., via the Hadamard Transform and Erasure Coding). Our evaluation shows that OptiNIC improves time-to-accuracy (TTA) by 2x and increases throughput by 1.6x for training and inference, respectively, across two public clouds (i.e., Hyperstack and CloudLab). OptiNIC also lowers 99th-percentile latency by 3.5x, cuts BRAM usage by 2.7x, and nearly doubles NIC resilience to faults-delivering a resilient, tail-optimized RDMA transport purpose-built for distributed ML workloads.
Similar Papers
Reimagining RDMA Through the Lens of ML
Distributed, Parallel, and Cluster Computing
Makes AI training much faster and more reliable.
An Extensible Software Transport Layer for GPU Networking
Networking and Internet Architecture
Makes AI training much faster by fixing network problems.
RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs
Hardware Architecture
Makes computer networks faster for AI.