Winner-Take-All Spiking Transformer for Language Modeling

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high energy consumption and hardware incompatibility of existing spiking Transformers in language modeling, which rely on softmax-based self-attention mechanisms. To overcome these limitations, the authors propose a Winner-Take-All (WTA) mechanism to design fully spike-driven, softmax-free self-attention modules—namely, WTA Spiking Self-Attention and Causal WTA Spiking Self-Attention—and integrate them into an end-to-end trainable spiking Transformer encoder-decoder architecture. This approach represents the first demonstration of softmax-free spiking language modeling, achieving strong performance across 16 diverse benchmarks spanning natural language understanding, question answering, and commonsense reasoning. The results highlight significant improvements in energy efficiency and underscore the potential of spiking Transformers for general-purpose language modeling and deployment on neuromorphic hardware.

Technology Category

Application Category

📝 Abstract
Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient property of Spiking Neural Networks (SNNs), have achieved impressive results in neuromorphic and vision tasks and attracted increasing attention. However, existing directly trained spiking transformers primarily focus on vision tasks. For language modeling with spiking transformer, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment. To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on them, we design WTA-based Encoder-only Spiking Transformer (WE-Spikingformer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-Spikingformer) for causal language modeling, systematically exploring softmax-free, spiking-driven Transformer architectures trained end-to-end for natural language processing tasks. Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Spiking Transformers
language modeling
softmax-based self-attention
energy efficiency
neuromorphic deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking Transformer
Winner-Take-All
softmax-free
spike-driven attention
energy-efficient NLP
🔎 Similar Papers
No similar papers found.