A Flexible Precision Scaling Deep Neural Network Accelerator with Efficient Weight Combination

πŸ“… 2025-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address hardware resource underutilization and configuration complexity in deploying mixed-precision DNNs on edge devices, this work proposes an efficient hardware accelerator supporting continuous 2–8-bit variable precision. It introduces a hybrid dataflow architecture featuring serial activation inputs and parallel preloaded weights, along with a novel dual-mode weight loading scheme and bit-width-adaptive decomposition mechanism. A bit-serial MAC unit based on a systolic array is designed, integrated with a carry-save addition (CSA) tree that is cross-row sign-agnostic, significantly enhancing both energy efficiency and flexibility. Implemented in TSMC 28nm CMOS, the accelerator achieves a peak throughput of 4.09 TOPS and an energy efficiency of 68.94 TOPS/W at 2-bitΓ—2-bit precisionβ€”1.8Γ— to 3.2Γ— higher than state-of-the-art accelerators. This work establishes a highly adaptive, low-overhead hardware paradigm for dynamic-precision inference on resource-constrained edge platforms.

Technology Category

Application Category

πŸ“ Abstract
Deploying mixed-precision neural networks on edge devices is friendly to hardware resources and power consumption. To support fully mixed-precision neural network inference, it is necessary to design flexible hardware accelerators for continuous varying precision operations. However, the previous works have issues on hardware utilization and overhead of reconfigurable logic. In this paper, we propose an efficient accelerator for 2~8-bit precision scaling with serial activation input and parallel weight preloaded. First, we set two loading modes for the weight operands and decompose the weight into the corresponding bitwidths, which extends the weight precision support efficiently. Then, to improve hardware utilization of low-precision operations, we design the architecture that performs bit-serial MAC operation with systolic dataflow, and the partial sums are combined spatially. Furthermore, we designed an efficient carry save adder tree supporting both signed and unsigned number summation across rows. The experiment result shows that the proposed accelerator, synthesized with TSMC 28nm CMOS technology, achieves peak throughput of 4.09TOPS and peak energy efficiency of 68.94TOPS/W at 2/2-bit operations.
Problem

Research questions and friction points this paper is trying to address.

Adaptive Hardware Accelerator
Deep Neural Network Optimization
Resource Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible Precision DNN Accelerator
Serial Activation and Parallel Weight Preloading
Improved Adder Tree Efficiency
L
Liang Zhao
South China University of Technology, Guangzhou, China
K
Kunming Shao
The Hong Kong University of Science and Technology, Hong Kong SAR, China; AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong SAR, China
Fengshi Tian
Fengshi Tian
Now: HKUST | Former: Fudan University & Westlake University
VLSI and Intelligence
T
Tim Kwang-Ting Cheng
The Hong Kong University of Science and Technology, Hong Kong SAR, China; AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong SAR, China
C
Chi-Ying Tsui
The Hong Kong University of Science and Technology, Hong Kong SAR, China; AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong SAR, China
Yi Zou
Yi Zou
Intel Labs
Near-data and in-memory computingComputer Architecture and Computer SystemsNon-volatile storagedistributed storagebig da