MOSAIC: A Workload-Driven Simulation and Design-Space Exploration Framework for Heterogeneous NPUs

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the inefficiency of existing homogeneous, multiply-accumulate (MAC)-centric neural processing units (NPUs) in supporting emerging AI models that heavily rely on non-MAC operators such as FFT, spiking neurons, and polynomials. To overcome this limitation, the authors propose MOSAIC, a framework that enables the first joint optimization of multiple dimensions—including heterogeneous compute units (Big/Little/specialized), dataflows, sparsity patterns, and engine types—within a single NPU architecture. Coupled with a heterogeneity-aware compiler for operator mapping, MOSAIC leverages a 7nm-process-calibrated cost model and a hybrid search strategy. Evaluations across 20 workloads demonstrate that a MOSAIC-customized general-purpose heterogeneous NPU (~200 mm²) achieves an average energy efficiency improvement of 46.91% over the best-performing homogeneous baseline of equivalent area, with modeling accuracy validated against NVDLA.

📝 Abstract

AI model architectures are diversifying rapidly. Although dense matrix multiplication underlies today's CNNs and transformers, emerging architectures (state-space models, long convolutions via the fast Fourier transform (FFT), Kolmogorov-Arnold networks, and spiking networks) are not multiply-accumulate (MAC) dominated; they spend much of their computation on vector and non-MAC primitives that homogeneous, MAC-centric neural processing units (NPUs) serve poorly. This has motivated heterogeneous NPUs (HPUs) built from non-identical tiles. Prior heterogeneous designs vary only one or two coarse knobs (typically MAC precision or array size) and are evaluated on narrow workloads; no existing framework supports fine-grained HPU design, where tiles differ across many architectural dimensions at once. We present MOSAIC, an analytical simulator and design-space-exploration (DSE) framework for HPU microarchitecture design. MOSAIC searches the joint space of tile-level heterogeneity: beyond array size and precision, it varies tile-type composition (large Big, small Little, and non-MAC Special-Function tiles), dataflow, sparsity mode, MAC engine type, and special-function units for non-MAC operators (FFT, spiking-integrate, polynomial). Unlike prior simulators that model a single homogeneous tile type, MOSAIC models non-MAC tiles with their own energy, area, and timing models and maps operators across a mix of tiles with a heterogeneity-aware compiler. A multi-seed pipeline pairing a stratified sweep with genetic-algorithm refinement returns Pareto-optimal designs, with cost models calibrated to a 7 nm node and cross-validated against NVIDIA's Deep Learning Accelerator (NVDLA). Across a 20-workload suite, the best general-purpose HPU found by MOSAIC (~200 mm^2 Big+Little+Special-Function) achieves +46.91% mean iso-area energy savings over the best iso-area homogeneous baseline.

Problem

Research questions and friction points this paper is trying to address.

heterogeneous NPUs

non-MAC primitives

design-space exploration

emerging AI models

tile-level heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous NPU

design-space exploration

non-MAC primitives