Oobleck: Low-Compromise Design for Fault Tolerant Accelerators

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prolonged datacenter hardware refresh cycles and increasing processor complexity have led to higher fault rates in accelerator datapaths, necessitating low-overhead fault tolerance. Method: This paper proposes Oobleck—a lightweight, area-efficient fault-tolerant architecture for on-chip accelerators—featuring modular datapath design and fine-grained redundancy. It further introduces Viscosity, an Actor-model-based co-design language that unifies hardware and software behavioral specification to improve development consistency and productivity. Contribution/Results: Evaluated on FPGA prototypes implementing FFT, AES, and DCT accelerators, Oobleck sustains 1.7×–5.16× hardware speedup under single faults with zero throughput degradation. Integration with hot-spare FPGAs further enhances reliability. Compared to conventional high-area-overhead approaches, Oobleck significantly improves chip utilization and long-term operational stability.

Technology Category

Application Category

📝 Abstract
Data center hardware refresh cycles are lengthening. However, increasing processor complexity is raising the potential for faults. To achieve longevity in the face of increasingly fault-prone datapaths, fault tolerance is needed, especially in on-chip accelerator datapaths. Previously researched methods for adding fault tolerance to accelerator designs require high area, lowering chip utilisation. We propose a novel architecture for accelerator fault tolerance, Oobleck, which leverages modular acceleration to enable fault tolerance without burdensome area requirements. In order to streamline the development and enforce modular conventions, we introduce the Viscosity language, an actor based approach to hardware-software co-design. Viscosity uses a single description of the accelerator's function and produces both hardware and software descriptions. Our high-level models of data centers indicate that our approach can decrease the number of failure-induced chip purchases inside data centers while not affecting aggregate throughput, thus reducing data center costs. To show the feasibility of our approach, we show three case-studies: FFT, AES, and DCT accelerators. We additionally profile the performance under the key parameters affecting latency. Under a single fault we can maintain speedups of between 1.7x-5.16x for accelerated applications over purely software implementations. We show further benefits can be achieved by adding hot-spare FPGAs into the chip.
Problem

Research questions and friction points this paper is trying to address.

Achieving fault tolerance in accelerators without high area costs
Streamlining development with a unified hardware-software co-design language
Reducing data center costs by minimizing failure-induced hardware replacements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular acceleration enables fault tolerance
Viscosity language simplifies hardware-software co-design
Hot-spare FPGAs enhance performance under faults
🔎 Similar Papers
No similar papers found.