CHIMERA: A Flexible and Scalable 3.1 TOPS/W AI-MCU with Transformer Accelerator and 563 Gb/s Shared-L2 Memory Subsystem with QoS Guarantees

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses the computational, energy-efficiency, and memory-bandwidth challenges of deploying rapidly evolving Transformer models on ultra-low-power edge devices by proposing a scalable AI microcontroller fabricated in 22nm FDX technology. The architecture integrates a nine-core RISC-V general-purpose cluster with a custom Transformer accelerator and introduces a high-bandwidth L2 memory island that uniquely supports multi-cluster data sharing and quality-of-service (QoS) guarantees, enabling ultra-low-latency communication. The chip achieves a peak energy efficiency of 3.1 TOPS/W and an area efficiency of 281 GOPS/mm², outperforming existing SoCs by up to 1.37× in energy efficiency, 100× in area efficiency, and reducing inference latency by 16×.

📝 Abstract

We present Chimera, a flexible and scalable Microcontroller Unit (MCU) designed to accelerate real-time inference of rapidly evolving transformer-based models at the ultra-low-power edge (hundred of mW). The chip, implemented in 22 nm FDX technology, integrates a transformer accelerator tightly coupled within a compute cluster featuring nine general-purpose RV32IMA cores. Scalability extends to the memory hierarchy through a novel L2 memory island subsystem, which enables data sharing across multiple clusters while delivering 563 Gb/s aggregate bandwidth. The L2 subsystem enforces quality-of-service guarantees for latency-critical traffic, achieving up to 16x latency reduction. Chimera achieves peak energy and area efficiencies of 3.1 TOPS/W and 281 GOPS/mm2, demonstrating 1.37x higher energy efficiency and up to 100x higher area efficiency compared to State of the Art (SoA) SoCs. Compared to SoA standalone accelerators, Chimera achieves comparable energy efficiency and up to 1.8x higher area efficiency.

Problem

Research questions and friction points this paper is trying to address.

Transformer

edge AI

ultra-low-power

real-time inference

memory subsystem

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer Accelerator

Shared-L2 Memory Subsystem

Quality-of-Service (QoS)