Score: 0

Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure

Published: October 19, 2025 | arXiv ID: 2510.16946v1

By: Erfan Darzi, Aldo Pareja, Shreeanant Bharadwaj

Potential Business Impact:

Finds computer slowdowns in seconds.

Business Areas:

Application Performance Management Data and Analytics, Software

Diagnosing GPU tail latency spikes in cloud and HPC infrastructure is critical for maintaining performance predictability and resource utilization, yet existing monitoring tools lack the granularity for root cause analysis in shared computing environments. We introduce an eBPF-based telemetry system that provides unified host-side monitoring of GPU workloads, correlating eBPF-derived host metrics with GPU-internal events for holistic system observability. The system achieves 81--88\% diagnostic accuracy, detects spikes within 5 seconds, and completes root cause analysis in 6--8 seconds, operating with 1.21\% CPU overhead at 100Hz sampling. Evaluated on distributed learning workloads, the system identifies root causes including NIC contention, PCIe pressure, and CPU interference, enabling operational debugging for multi-tenant GPU infrastructure without requiring cluster-wide instrumentation.

GPU Under Pressure: Estimating Application's Stress via Telemetry and Performance Counters

Distributed, Parallel, and Cluster Computing

Measures computer chip strain to predict failures.

7 Nov 2025 0

86%

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Distributed, Parallel, and Cluster Computing

Analyzes computer speed problems faster.

21 Oct 2025 0

86%

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

Performance

Finds computer problems to speed up work.

29 Oct 2025 0

View PDF Login to Bookmark

Page Count

5 pages

Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure

Finds computer slowdowns in seconds.

Technical Abstract

GPU Under Pressure: Estimating Application's Stress via Telemetry and Performance Counters

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry