gpu_ext: Extensible OS Policies for GPUs via eBPF
By: Yusheng Zheng , Tong Yu , Yiwei Yang and more
Performance in modern GPU-centric systems depends increasingly on resource management policies, such as memory placement, scheduling, and observability. However, a one-size-fits-all policy performs poorly across diverse workloads. Existing approaches present a tradeoff: user-space runtimes offer programmability but lack cross-tenant visibility and fine-grained hardware control, while OS kernel modification introduce complexity and safety risks. To address this, we argue that the GPU driver and device layer must serve as an extensible OS policy interface. The emerging eBPF offers a possibility, but naively transplanting host-side eBPF is insufficient: it cannot observe critical device-side events, and directly injecting policy code into GPU kernels affects safety and efficiency. We present gpu_ext, an eBPF-based policy runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers to expose safe hooks and introduces a device-side eBPF runtime that executes verified policy logic within GPU kernels, enabling coherent, application-transparent policies. Evaluation on realistic workloads, including inference, training, and vector search, shows that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x with low overhead, without modifying applications or restarting drivers.
Similar Papers
eBPF-PATROL: Protective Agent for Threat Recognition and Overreach Limitation using eBPF in Containerized and Virtualized Environments
Cryptography and Security
Stops hackers from breaking into computer clouds.
EPSO: A Caching-Based Efficient Superoptimizer for BPF Bytecode
Software Engineering
Makes computer programs run faster and smaller.
Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure
Distributed, Parallel, and Cluster Computing
Finds computer slowdowns in seconds.