TinyCL: An Efficient Hardware Architecture for Continual Learning on Autonomous Systems

📅 2024-02-15

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

Existing DNN accelerators lack support for continual learning (CL) as they only enable forward propagation, whereas CL demands efficient backward propagation and dynamic weight updates to mitigate catastrophic forgetting—operations that incur substantial resource overhead. This work presents the first application-specific integrated circuit (ASIC) architecture tailored for CL in resource-constrained autonomous systems, integrating forward/backward propagation, online weight updates, and memory-aware CL task scheduling. Key contributions include: (1) custom processing and CL control units supporting the full CL pipeline; (2) a serpentine sliding-window convolution memory-access optimization; and (3) runtime-reconfigurable multiply-accumulate (MAC) units. Implemented in 65 nm CMOS, the chip occupies 4.74 mm² and consumes 86 mW. On CIFAR-10, it completes one training epoch in 1.76 s—58× faster than an NVIDIA Tesla P100 GPU.

Technology Category

Application Category

📝 Abstract

The Continuous Learning (CL) paradigm consists of continuously evolving the parameters of the Deep Neural Network (DNN) model to progressively learn to perform new tasks without reducing the performance on previous tasks, i.e., avoiding the so-called catastrophic forgetting. However, the DNN parameter update in CL-based autonomous systems is extremely resource-hungry. The existing DNN accelerators cannot be directly employed in CL because they only support the execution of the forward propagation. Only a few prior architectures execute the backpropagation and weight update, but they lack the control and management for CL. Towards this, we design a hardware architecture, TinyCL, to perform CL on resource-constrained autonomous systems. It consists of a processing unit that executes both forward and backward propagation, and a control unit that manages memory-based CL workload. To minimize the memory accesses, the sliding window of the convolutional layer moves in a snake-like fashion. Moreover, the Multiply-and-Accumulate units can be reconfigured at runtime to execute different operations. As per our knowledge, our proposed TinyCL represents the first hardware accelerator that executes CL on autonomous systems. We synthesize the complete TinyCL architecture in a 65 nm CMOS technology node with the conventional ASIC design flow. It executes 1 epoch of training on a Conv + ReLU + Dense model on the CIFAR10 dataset in 1.76 s, while 1 training epoch of the same model using an Nvidia Tesla P100 GPU takes 103 s, thus achieving a 58x speedup, consuming 86 mW in a 4.74 mm2 die.

Problem

Research questions and friction points this paper is trying to address.

Enables continual learning on resource-constrained systems.

Minimizes catastrophic forgetting in deep neural networks.

Optimizes memory and power for autonomous system efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware architecture for continual learning

Executes forward and backward propagation

Reconfigurable Multiply-and-Accumulate units

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow