SlowPoke: Understanding and Detecting On-Chip Fail-Slow Failures in Many-Core Systems

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Fail-slow failures in multicore systems are notoriously difficult to detect due to their subtle, time-varying manifestations and low observability. Method: This paper proposes a lightweight, hardware-aware detection framework that integrates compiler-based instrumentation, kilobyte-scale real-time trace compression, chip topology modeling, and a machine learning–driven anomaly ranking algorithm to enable low-overhead root-cause localization. Contribution/Results: It is the first work to achieve continuous online monitoring under resource constraints on many-core architectures. Under typical workloads, it reduces trace storage overhead by 115.9× on average, achieves 86.77% detection accuracy, and maintains a low false positive rate of 12.11%, while demonstrating cross-architecture scalability. The core innovation lies in the synergistic integration of compiler-assisted monitoring, topology-aware compression, and hardware-coordinated anomaly ranking—effectively overcoming the trade-off between overhead and accuracy that limits existing approaches.

Technology Category

Application Category

📝 Abstract

Many-core architectures are essential for high-performance computing, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SlowPoke, a lightweight, hardware-aware framework for practical on-chip fail-slow detection. SlowPoke combines compiler-based instrumentation for low-overhead monitoring, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SlowPoke on a wide range of representative many-core workloads, and the results demonstrate that SlowPoke reduces the storage overhead of detection traces by an average of 115.9$ imes$, while achieving an average fail-slow detection accuracy of 86.77% and a false positive rate (FPR) of 12.11%. More importantly, SlowPoke scales effectively across different many-core architectures, making it practical for large-scale deployments.

Problem

Research questions and friction points this paper is trying to address.

Detecting on-chip fail-slow failures in many-core systems

Overcoming strict memory limits for failure monitoring

Pinpointing failure root causes across hardware topology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler-based instrumentation for low-overhead monitoring

On-the-fly trace compression within kilobytes memory

Topology-aware ranking algorithm for root cause analysis

🔎 Similar Papers

No similar papers found.