Modeling Spatially Correlated Failure-time Data Under Two Distance Functions with an Application to Titan GPU Data

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual spatial dependence arising from misalignment between physical layout and network topology in GPU failure data, this paper proposes a bivariate spatial random-effects survival regression model that jointly incorporates physical distance and logical connectivity. Methodologically, we develop a Bayesian hierarchical framework featuring two independent yet synergistic spatial random effects—one capturing geographical proximity and the other modeling topological adjacency—and implement posterior inference via Stan-based MCMC sampling. Our key contribution is the first explicit integration of network topology into spatial correlation modeling for survival analysis, moving beyond conventional single-physical-space assumptions. Applied to large-scale GPU failure data from the Titan supercomputer, the model demonstrates significantly improved capture of spatial dependence. Simulation studies confirm accurate parameter estimation and reliable statistical inference. This work establishes a novel paradigm for reliability modeling under heterogeneous spatial structures.

Technology Category

Application Category

📝 Abstract
One common approach to statistical analysis of spatially correlated data relies on defining a correlation structure based solely on unknown parameters and the physical distance between the locations of observed values. However, some data have a complex spatial structure that cannot be adequately described with the physical distance alone. In this work, the spatial failure-time data of focus contains information on GPUs that are connected through a network fabric topology that differs from their physical layout and that is expected to introduce additional correlations. The proposed lifetime regression model includes random effects capturing the dependency due to physical location as well as random effects explaining the dependency due to logical connections between GPUs. The analysis of this GPU dataset serves as an example of models with multiple spatial random effects and the ideas presented can be extended to other applications with complex spatial structures. A Bayesian modeling scheme is recommended for this class of analyses. The examples in this work use the software package, Stan, to produce Markov chain Monte Carlo draws for parameter estimation. This modeling effort is validated through simulation which demonstrates accuracy in statistical inference. We also apply the developed framework to the large-scale Titan GPU failure time data.
Problem

Research questions and friction points this paper is trying to address.

Modeling GPU failure times with spatial correlations
Incorporating both physical and logical distance functions
Addressing complex network fabric topology in data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual spatial random effects for physical and logical distances
Bayesian modeling with MCMC for parameter estimation
Validation through simulation and application to GPU data
🔎 Similar Papers
2024-05-07IEEE Transactions on Neural Networks and Learning SystemsCitations: 2