Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

📅 2023-11-16

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

221K/year

🤖 AI Summary

Matrix multiplication (MatMul) in deep neural networks (DNNs) imposes severe computational and energy bottlenecks. Method: This work introduces the first hardware accelerator leveraging Maddness—a hash-decision-tree-based approximate computing paradigm—to replace conventional multiply-accumulate (MAC) operations. It integrates product quantization (PQ) with lookup tables (LUTs) to enable fully multiplication-free DNN inference. Crucially, it co-designs the hash decision tree and PQ lookup into a modular, fine-grained architecture, and performs comprehensive technology scaling from 14 nm to 3 nm. Results: Evaluated on ResNet-9 for CIFAR-10, the accelerator maintains >92.5% Top-1 accuracy while achieving 161 TOp/s/W at 3 nm (0.55 V), representing 25× and 15× improvements in energy and area efficiency, respectively—setting a new state-of-the-art for DNN inference efficiency.

📝 Abstract

From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.

Problem

Research questions and friction points this paper is trying to address.

Addresses high computational and energy demands in AI matrix multiplications

Proposes a multiplier-free matrix multiplication method using Maddness

Enhances Maddness with differentiable approximation for better accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Maddness for gradient-based fine-tuning

Hash-based product quantization eliminating multipliers

Hardware accelerator with 161 TOp/s/W efficiency

🔎 Similar Papers

No similar papers found.