Strix: Re-thinking NPU Reliability from a System Perspective

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing system-level fault tolerance approaches for NPUs incur excessive performance and hardware overhead due to coarse-grained redundancy, rendering them inadequate for the stringent reliability demands of safety-critical applications. This work proposes Strix, a full-stack reliability framework spanning microarchitecture, instruction set, and programming model. By performing fine-grained partitioning of the inference pipeline on an open-source SoC, Strix precisely identifies dominant fault modes and deploys targeted protection mechanisms. Breaking away from conventional monolithic NPU fault tolerance paradigms, Strix achieves sub-microsecond fault detection and correction with only 1.04× performance overhead and minimal hardware cost, substantially enhancing system reliability.

Technology Category

Application Category

📝 Abstract
DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.
Problem

Research questions and friction points this paper is trying to address.

NPU reliability
hardware faults
system-level reliability
fault tolerance
neural network accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

NPU reliability
full-stack framework
fine-grained fault tolerance
system-level resilience
hardware-software co-design
🔎 Similar Papers
No similar papers found.