🤖 AI Summary
This work addresses the underutilization of wide DSP data paths in low-bit quantized neural networks deployed on FPGAs, where existing packing strategies are constrained by fixed bit widths or require substantial external logic. The authors propose a novel dynamic packing technique leveraging the internal pre-adder of DSP blocks, enabling, for the first time, efficient reuse of wide multiplier resources for arbitrary signed or unsigned input bit widths. A customized accelerator architecture tailored for matrix-vector multiplication and convolution is co-designed with this packing scheme. Deeply integrated into the AMD FINN framework, the approach significantly reduces external logic overhead. Evaluated on the UltraNet model, it achieves a 21% reduction in LUT usage and a 36% improvement in frames-per-second per DSP (FPS/DSP) compared to the FINN baseline.
📝 Abstract
Deep Neural Networks increasingly employ low-precision quantization to reduce computational requirements. While FPGAs are well suited for workloads with heterogeneous precisions, their dedicated digital signal processing (DSP) slices only feature fixed-width datapaths that are significantly underutilized by low-bitwidth arithmetic. While previous approaches have already introduced the packing of multiple values onto the same wide DSP datapath, they either only support specific fixed bitwidths or are wasteful regarding the use of additional support logic external to the DSP. This paper proposes an efficient method to dynamically pack multiple (un-)signed inputs with arbitrary bitwidths into a wide multiplier path by leveraging the DSP's internal pre-adder. Building on this, we present two distinct architectures, one optimized for matrix-vector multiplications and the other for convolutions. Our implementations are integrated into AMD's FINN framework. With these optimizations, we reduce the LUT utilization by 21% and increase the FPS/DSP by 36% for the UltraNet model compared to the FINN reference.