🤖 AI Summary
Deep neural network (DNN) acceleration on heterogeneous dataflow accelerators faces two key bottlenecks: high off-chip memory traffic due to coarse-grained per-layer mapping, and insufficient latency responsiveness for edge deployment caused by batch-level pipelining.
Method: This work proposes a fine-grained, depth-first scheduling methodology that enables efficient layer-fused DNN mapping onto heterogeneous multi-core accelerators.
Contribution/Results: We introduce Stream—a first-of-its-kind open-source modeling framework jointly optimizing multi-granularity scheduling and heterogeneous architecture—and reveal the decisive impact of high-level architectural choices on energy efficiency. Leveraging scheduling modeling, multi-objective optimization (energy/delay/memory), layer-fused DNN representation, and hardware-abstracted simulation, Stream achieves <5% measurement error across three state-of-the-art hardware platforms. Compared to conventional layer-granularity scheduling, it improves the energy-delay product by up to 30× and delivers core-to-heterogeneous-multi-core speedups of 2.4–30×.
📝 Abstract
To keep up with the ever-growing performance demand of neural networks, specialized hardware (HW) accelerators are shifting towards multi-core and chiplet architectures. So far, these multi-accelerator systems exploit the increased parallelism by pipelining different NN layers across input batches on different cores to increase throughput. Yet, when pursuing this with non-batched layer-by-layer scheduling of latency-critical applications, this fails to fully exploit the available HW resources towards energy-efficient execution at the edge. This work, therefore, enables fine-grained depth-first scheduling of layer-fused DNNs onto multi-core architectures through an open-source modeling framework called Stream. Stream is capable of representing a wide range of scheduling granularities and HW architectures and optimizes execution schedules towards minimal energy, minimal latency and/or minimal memory footprint for constrained edge devices. We validate against three SotA HW implementations employing layer-fused scheduling showing tight matching with measured efficiencies. Using Stream in further explorations, we demonstrate that high-level architectural decisions greatly impact hardware efficiency under the fine-grained scheduling paradigm, reducing the energy-delay product from 2.4x for single-core architectures to up to 30x for heterogeneous multi-core architectures compared to the traditional scheduling at layer granularity.