NVR: Vector Runahead on NPUs for Sparse Memory Access

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse deep neural networks (DNNs) reduce parameter count but suffer from irregular memory access patterns, leading to high NPU cache miss rates and limited practical acceleration. To address this, we propose NVR—a lightweight vector runahead execution mechanism—marking the first adaptation of runahead execution to NPU microarchitectures. NVR enables hardware-only, decoupled speculative sub-thread execution without compiler or algorithmic support. It further introduces a 16 KB on-chip cache co-optimized with sparse memory access pattern awareness. With less than 5% hardware overhead and strong generality, NVR reduces cache miss rates by up to 90% and delivers an average 4.0× speedup across sparse DNN workloads. Compared to simply scaling the L2 cache to equivalent capacity, NVR achieves five times higher performance gain.

Technology Category

Application Category

📝 Abstract
Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.
Problem

Research questions and friction points this paper is trying to address.

Address cache miss in sparse DNN workloads
Optimize memory access on NPUs
Enhance performance with minimal hardware overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

NPU Vector Runahead prefetching
Lightweight hardware sub-thread
Small cache integration enhancement
🔎 Similar Papers
No similar papers found.
H
Hui Wang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
Z
Zhengpeng Zhao
Huazhong University of Science and Technology
J
Jing Wang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
Y
Yushu Du
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
Y
Yuan Cheng
Nanjing University
Bing Guo
Bing Guo
Harbin Institute of Technology, Shenzhen
Bioimagingnanomedicinesphoto-electrical materialsbatteriespolymers
H
He Xiao
Harbin Institute of Technology
Chenhao Ma
Chenhao Ma
The Chinese University of Hong Kong, Shenzhen
Data managementdata mining
Xiaomeng Han
Xiaomeng Han
Southeast University
LLMs Accelerator
D
Dean You
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
J
Jiapeng Guan
Dalian University of Technology
R
Ran Wei
Dalian University of Technology
D
Dawei Yang
Houmo AI
Z
Zhe Jiang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University