SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

📅 2024-07-05
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the high energy consumption and computational overhead of large language model (LLM) inference, this work proposes the first biologically inspired spiking neural network (SNN) architecture tailored for LLMs. Methodologically, it introduces a saliency-based spiking trigger mechanism, integrates generalized integrate-and-fire (GIF) neurons with an optimal brain spike allocation framework—compressing spike sequence length to approximately log₂T bits—and combines spike sparsification, channel-wise sensitivity modeling, and synergistic OmniQuant/GPTQ quantization. Evaluated on LLaMA-7B (W4A4), the approach reduces WikiText2 perplexity by 11.01% and improves commonsense reasoning accuracy by 2.55%. Critically, GPTQ-quantized linear layers support direct additive spike computation, enabling substantial performance gains over existing pulse-based LLMs (PB-LLMs) and overcoming key scalability bottlenecks in SNN deployment for large models.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) with billions of parameters have improved performance in various applications, but their inference processes demand significant energy and computational resources. In contrast, the human brain, with approximately 86 billion neurons, is much more energy-efficient than LLMs with similar parameters. Inspired by this, we redesign 7$sim$70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model, SpikeLLM. Coupled with the proposed model, two essential approaches are proposed to improve spike training efficiency: Generalized Integrate-and-Fire (GIF) neurons to compress spike length from $T$ to $frac{T}{L} log_2 L$ bits, and an Optimal Brain Spiking framework to divide outlier channels and allocate different $T$ for GIF neurons, which further compresses spike length to approximate $log_2T$ bits. The necessity of spike-driven LLM is proved by comparison with quantized LLMs with similar operations. In the OmniQuant pipeline, SpikeLLM reduces 11.01% WikiText2 perplexity and improves 2.55% accuracy of common scene reasoning on a LLAMA-7B W4A4 model. In the GPTQ pipeline, SpikeLLM achieves direct additive in linear layers, significantly exceeding PB-LLMs.
Problem

Research questions and friction points this paper is trying to address.

Redesign LLMs using bio-plausible spiking mechanisms for energy efficiency.
Propose SpikeLLM to improve spike training efficiency and reduce computational resources.
Demonstrate SpikeLLM's superiority over quantized LLMs in performance and energy efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking neural networks for energy-efficient LLMs
Generalized Integrate-and-Fire neurons compress spike length
Optimal Brain Spiking framework for efficient spike training
🔎 Similar Papers
No similar papers found.