🤖 AI Summary
Fail-slow failures in multicore systems are notoriously difficult to detect due to their subtle, time-varying manifestations and low observability.
Method: This paper proposes a lightweight, hardware-aware detection framework that integrates compiler-based instrumentation, kilobyte-scale real-time trace compression, chip topology modeling, and a machine learning–driven anomaly ranking algorithm to enable low-overhead root-cause localization.
Contribution/Results: It is the first work to achieve continuous online monitoring under resource constraints on many-core architectures. Under typical workloads, it reduces trace storage overhead by 115.9× on average, achieves 86.77% detection accuracy, and maintains a low false positive rate of 12.11%, while demonstrating cross-architecture scalability. The core innovation lies in the synergistic integration of compiler-assisted monitoring, topology-aware compression, and hardware-coordinated anomaly ranking—effectively overcoming the trade-off between overhead and accuracy that limits existing approaches.
📝 Abstract
Many-core architectures are essential for high-performance computing, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SlowPoke, a lightweight, hardware-aware framework for practical on-chip fail-slow detection. SlowPoke combines compiler-based instrumentation for low-overhead monitoring, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SlowPoke on a wide range of representative many-core workloads, and the results demonstrate that SlowPoke reduces the storage overhead of detection traces by an average of 115.9$ imes$, while achieving an average fail-slow detection accuracy of 86.77% and a false positive rate (FPR) of 12.11%. More importantly, SlowPoke scales effectively across different many-core architectures, making it practical for large-scale deployments.