Score: 0

Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure

Published: October 19, 2025 | arXiv ID: 2510.16946v1

By: Erfan Darzi, Aldo Pareja, Shreeanant Bharadwaj

Potential Business Impact:

Finds computer slowdowns in seconds.

Business Areas:
Application Performance Management Data and Analytics, Software

Diagnosing GPU tail latency spikes in cloud and HPC infrastructure is critical for maintaining performance predictability and resource utilization, yet existing monitoring tools lack the granularity for root cause analysis in shared computing environments. We introduce an eBPF-based telemetry system that provides unified host-side monitoring of GPU workloads, correlating eBPF-derived host metrics with GPU-internal events for holistic system observability. The system achieves 81--88\% diagnostic accuracy, detects spikes within 5 seconds, and completes root cause analysis in 6--8 seconds, operating with 1.21\% CPU overhead at 100Hz sampling. Evaluated on distributed learning workloads, the system identifies root causes including NIC contention, PCIe pressure, and CPU interference, enabling operational debugging for multi-tenant GPU infrastructure without requiring cluster-wide instrumentation.

Page Count
5 pages

Category
Computer Science:
Distributed, Parallel, and Cluster Computing