Strix: Re-thinking NPU Reliability from a System Perspective

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing system-level fault tolerance approaches for NPUs incur excessive performance and hardware overhead due to coarse-grained redundancy, rendering them inadequate for the stringent reliability demands of safety-critical applications. This work proposes Strix, a full-stack reliability framework spanning microarchitecture, instruction set, and programming model. By performing fine-grained partitioning of the inference pipeline on an open-source SoC, Strix precisely identifies dominant fault modes and deploys targeted protection mechanisms. Breaking away from conventional monolithic NPU fault tolerance paradigms, Strix achieves sub-microsecond fault detection and correction with only 1.04× performance overhead and minimal hardware cost, substantially enhancing system reliability.

Technology Category

Application Category

📝 Abstract

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.

Problem

Research questions and friction points this paper is trying to address.

NPU reliability

hardware faults

system-level reliability

fault tolerance

neural network accelerators

Innovation

Methods, ideas, or system contributions that make the work stand out.

NPU reliability

full-stack framework

fine-grained fault tolerance