RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs
By: Maximilian Jakob Heer , Benjamin Ramhorst , Yu Zhu and more
Potential Business Impact:
Makes computer networks faster for AI.
Data-intensive applications in data centers, especially machine learning (ML), have made the network a bottleneck, which in turn has motivated the development of more efficient network protocols and infrastructure. For instance, remote direct memory access (RDMA) has become the standard protocol for data transport in the cloud as it minimizes data copies and reduces CPU-utilization via host-bypassing. Similarly, an increasing amount of network functions and infrastructure have moved to accelerators, SmartNICs, and in-network computing to bypass the CPU. In this paper we explore the implementation and deployment of RoCE BALBOA, an open-source, RoCE v2-compatible, scalable up to hundreds of queue-pairs, and 100G-capable RDMA-stack that can be used as the basis for building accelerators and smartNICs. RoCE BALBOA is customizable, opening up a design space and offering a degree of adaptability not available in commercial products. We have deployed BALBOA in a cluster using FPGAs and show that it has latency and performance characteristics comparable to commercial NICs. We demonstrate its potential by exploring two classes of use cases. One involves enhancements to the protocol for infrastructure purposes (encryption, deep packet inspection using ML). The other showcases the ability to perform line-rate compute offloads with deep pipelines by implementing commercial data preprocessing pipelines for recommender systems that process the data as it arrives from the network before transferring it directly to the GPU. These examples demonstrate how BALBOA enables the exploration and development of SmartNICs and accelerators operating on network data streams.
Similar Papers
Network-accelerated Active Messages
Networking and Internet Architecture
Moves computer work to network cards.
An RDMA-First Object Storage System with SmartNIC Offload
Hardware Architecture
Makes AI learn much faster by speeding up data access.
Reimagining RDMA Through the Lens of ML
Distributed, Parallel, and Cluster Computing
Makes AI training much faster and more reliable.