🤖 AI Summary
Existing system-level fault tolerance approaches for NPUs incur excessive performance and hardware overhead due to coarse-grained redundancy, rendering them inadequate for the stringent reliability demands of safety-critical applications. This work proposes Strix, a full-stack reliability framework spanning microarchitecture, instruction set, and programming model. By performing fine-grained partitioning of the inference pipeline on an open-source SoC, Strix precisely identifies dominant fault modes and deploys targeted protection mechanisms. Breaking away from conventional monolithic NPU fault tolerance paradigms, Strix achieves sub-microsecond fault detection and correction with only 1.04× performance overhead and minimal hardware cost, substantially enhancing system reliability.
📝 Abstract
DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.