Score: 0

Privacy-Preserving Performance Profiling of In-The-Wild GPUs

Published: September 26, 2025 | arXiv ID: 2509.21762v1

By: Ian McDougall , Michael Davies , Rahul Chatterjee and more

Potential Business Impact:

Lets companies see how many computers are working well.

Business Areas:

GPU Hardware

GPUs are the dominant platform for many important applications today including deep learning, accelerated computing, and scientific simulation. However, as the complexity of both applications and hardware increases, GPU chip manufacturers face a significant challenge: how to gather comprehensive performance characteristics and value profiles from GPUs deployed in real-world scenarios. Such data, encompassing the types of kernels executed and the time spent in each, is crucial for optimizing chip design and enhancing application performance. Unfortunately, despite the availability of low-level tools like NSYS and NCU, current methodologies fall short, offering data collection capabilities only on an individual user basis rather than a broader, more informative fleet-wide scale. This paper takes on the problem of realizing a system that allows planet-scale real-time GPU performance profiling of low-level hardware characteristics. The three fundamental problems we solve are: i) user experience of achieving this with no slowdown; ii) preserving user privacy, so that no 3rd party is aware of what applications any user runs; iii) efficacy in showing we are able to collect data and assign it applications even when run on 1000s of GPUs. Our results simulate a 100,000 size GPU deployment, running applications from the Torchbench suite, showing our system addresses all 3 problems.

GPU Under Pressure: Estimating Application's Stress via Telemetry and Performance Counters

Distributed, Parallel, and Cluster Computing

Measures computer chip strain to predict failures.

7 Nov 2025 0

87%

RTGPU: Real-Time Computing with Graphics Processing Units

Hardware Architecture

Makes computers do hard jobs on time.

8 Jul 2025 0

86%

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency

Distributed, Parallel, and Cluster Computing

Makes AI training faster and use less power.

9 Dec 2025 1

View PDF Login to Bookmark

Page Count

26 pages

Privacy-Preserving Performance Profiling of In-The-Wild GPUs

Lets companies see how many computers are working well.

Technical Abstract

GPU Under Pressure: Estimating Application's Stress via Telemetry and Performance Counters

RTGPU: Real-Time Computing with Graphics Processing Units

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency