FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional LockStep mechanisms in multi-/many-core real-time systems suffer from resource underutilization and degraded real-time performance due to static configuration, rigid synchronization, and fixed error-detection regions. To address this, we propose FlexStep—a hardware-software co-designed architecture enabling dynamic core configuration and asynchronous, preemptible error detection. Its key innovation lies in cross-layer coordination across SoC microarchitecture, ISA extensions, and OS scheduling: it introduces a lightweight hardware checker unit and a dedicated scheduler to enable runtime flexibility in selecting error-detection regions, timing, and core assignments. Experimental evaluation demonstrates that FlexStep maintains high reliability while significantly improving resource utilization and task schedulability—reducing average latency by 37%—and the full implementation is open-sourced.

Technology Category

Application Category

📝 Abstract
Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
Problem

Research questions and friction points this paper is trying to address.

Enhances error detection in multi-core real-time systems
Reduces resource over-provisioning and improves responsiveness
Supports dynamic core configuration and flexible task scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic core configuration for error detection
Asynchronous preemptive error detection mechanism
Integration across SoC, ISA, and OS layers
T
Tinglue Wang
Southeast University, China
Y
Yiming Li
Southeast University, China
W
Wei Tang
Southeast University, China
J
Jiapeng Guan
Dalian University of Technology, China
Z
Zhen-Jun Guo
Southeast University, China
Renshuang Jiang
Renshuang Jiang
National University of Defense Technology
R
Ran Wei
Lancaster University, UK
J
Jing Li
New Jersey Institute of Technology, US
Z
Zhe Jiang
Southeast University, China