Score: 0

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Published: October 21, 2025 | arXiv ID: 2510.18300v1

By: Ankur Lahiry , Ayush Pokharel , Banooqa Banday and more

Potential Business Impact:

Analyzes computer speed problems faster.

Business Areas:

GPU Hardware

Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance analysis both computationally expensive and time-consuming. To address this challenge, we present an end-to-end parallel performance analysis framework designed to handle multiple large-scale GPU traces efficiently. Our proposed framework partitions and processes trace data concurrently and employs causal graph methods and parallel coordinating chart to expose performance variability and dependencies across execution flows. Experimental results demonstrate a 67% improvement in terms of scalability, highlighting the effectiveness of our pipeline for analyzing multiple traces independently.

Terabyte-Scale Analytics in the Blink of an Eye

Databases

Runs big data jobs 60 times faster.

10 Jun 2025 3

87%

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

Distributed, Parallel, and Cluster Computing

Predicts training time for giant computer jobs.

17 Oct 2025 0

86%

PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production

Distributed, Parallel, and Cluster Computing

Finds computer training problems fast.

10 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

12 pages

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Analyzes computer speed problems faster.

Technical Abstract

Terabyte-Scale Analytics in the Blink of an Eye

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production