A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of identifying performance bottlenecks, high analysis overhead, and unclear attribution of variability in large-scale GPU tracing data within heterogeneous HPC environments, this paper proposes the first end-to-end distributed analytical framework integrating causal graph modeling with parallel coordination graphs. The framework enables concurrent processing of multi-GPU traces and causal inference of performance variability through distributed data partitioning, pipelined parallel computation, scalable causal graph construction, and coordinated visualization. Its core innovation lies in introducing causal inference into GPU performance tracing analysis—enabling, for the first time, cross-trace execution dependency modeling and precise root-cause localization of bottlenecks. Experimental results demonstrate that, compared to baseline methods, the framework improves scalability by 67% in multi-trace independent analysis and significantly accelerates performance bottleneck identification and diagnosis.

Technology Category

Application Category

📝 Abstract
Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance analysis both computationally expensive and time-consuming. To address this challenge, we present an end-to-end parallel performance analysis framework designed to handle multiple large-scale GPU traces efficiently. Our proposed framework partitions and processes trace data concurrently and employs causal graph methods and parallel coordinating chart to expose performance variability and dependencies across execution flows. Experimental results demonstrate a 67% improvement in terms of scalability, highlighting the effectiveness of our pipeline for analyzing multiple traces independently.
Problem

Research questions and friction points this paper is trying to address.

Modeling performance variability in GPU traces
Handling large-scale GPU trace data efficiently
Identifying performance dependencies across execution flows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed framework processes GPU traces concurrently
Employs causal graphs to expose performance variability
Parallel coordinating charts reveal execution flow dependencies
🔎 Similar Papers
No similar papers found.
Ankur Lahiry
Ankur Lahiry
Doctoral Instructional Assistant
Machine learningHigh Performance ComputingCompiler
A
Ayush Pokharel
Texas State University, San Marcos, TX 78666, USA
B
Banooqa Banday
Texas State University, San Marcos, TX 78666, USA
Seth Ockerman
Seth Ockerman
University of Wisconsin Madison
Computer ScienceAISystemsHPC
Amal Gueroudji
Amal Gueroudji
Argonne National Laboratory
HPCdistributed computingIn Situ AnalyticsTask-based Programmingprogramming Models
M
Mohammad Zaeed
Texas State University, San Marcos, TX 78666, USA
T
Tanzima Z. Islam
Texas State University, San Marcos, TX 78666, USA
Line Pouchard
Line Pouchard
Sandia National Laboratories
ProvenanceCurationReproducibilitySemantic Web