Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure
By: Erfan Darzi, Aldo Pareja, Shreeanant Bharadwaj
Potential Business Impact:
Finds computer slowdowns in seconds.
Diagnosing GPU tail latency spikes in cloud and HPC infrastructure is critical for maintaining performance predictability and resource utilization, yet existing monitoring tools lack the granularity for root cause analysis in shared computing environments. We introduce an eBPF-based telemetry system that provides unified host-side monitoring of GPU workloads, correlating eBPF-derived host metrics with GPU-internal events for holistic system observability. The system achieves 81--88\% diagnostic accuracy, detects spikes within 5 seconds, and completes root cause analysis in 6--8 seconds, operating with 1.21\% CPU overhead at 100Hz sampling. Evaluated on distributed learning workloads, the system identifies root causes including NIC contention, PCIe pressure, and CPU interference, enabling operational debugging for multi-tenant GPU infrastructure without requiring cluster-wide instrumentation.
Similar Papers
GPU Under Pressure: Estimating Application's Stress via Telemetry and Performance Counters
Distributed, Parallel, and Cluster Computing
Measures computer chip strain to predict failures.
A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces
Distributed, Parallel, and Cluster Computing
Analyzes computer speed problems faster.
Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry
Performance
Finds computer problems to speed up work.