🤖 AI Summary
Hardware accelerators deployed in data centers and safety-critical systems are vulnerable to control-flow soft errors, leading to silent data corruption and system failure. To address this, we propose two synergistic online control-flow error detection techniques: (1) specification-derived Petri net modeling and (2) behavior-derived state sequence comparison. Our approach is the first to jointly leverage specification-level formal modeling and runtime dynamic behavioral analysis for fine-grained anomaly detection. It supports flexible configuration under area constraints, balancing high fault coverage with low overhead. Evaluated on four RTL designs—convolution, Gaussian blur, AES encryption, and NoC router—the method achieves 48%–100% fault detection rates for bit-flips in control registers and master control inputs, with only 0.5%–10% area overhead.
📝 Abstract
In hardware accelerators used in data centers and safety-critical applications, soft errors and resultant silent data corruption significantly compromise reliability, particularly when upsets occur in control-flow operations, leading to severe failures. To address this, we introduce two methods for monitoring control flows: using specification-derived Petri nets and using behavior-derived state transitions. We validated our method across four designs: convolutional layer operation, Gaussian blur, AES encryption, and a router in Network-on-Chip. Our fault injection campaign targeting the control registers and primary control inputs demonstrated high error detection rates in both datapath and control logic. Synthesis results show that a maximum detection rate is achieved with a few to around 10% area overhead in most cases. The proposed detectors quickly detect 48% to 100% of failures resulting from upsets in internal control registers and perturbations in primary control inputs. The two proposed methods were compared in terms of area overhead and error detection rate. By selectively applying these two methods, a wide range of area constraints can be accommodated, enabling practical implementation and effectively enhancing error detection capabilities.