Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates the theoretical advantages and empirical efficacy of replacing softmax-based self-attention with sigmoid self-attention in Transformers. Addressing softmax’s inherent token-level over-competition and poor sample efficiency, we introduce a novel Mixture-of-Experts (MoE)-inspired modeling framework for self-attention rows—marking the first rigorous analysis of self-attention from an MoE perspective. We formally prove that sigmoid self-attention achieves superior sample complexity and tighter generalization error bounds compared to softmax, requiring significantly fewer samples for consistent estimation. Experiments on synthetic benchmarks and multi-task real-world scenarios confirm faster convergence, improved generalization, and better preservation of salient features. Our core contribution lies in reframing self-attention through the MoE paradigm, transcending conventional competitive normalization assumptions. This yields both a new theoretical foundation and practical design principles for efficient, lightweight Transformer architectures.

Technology Category

Application Category

📝 Abstract

At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we illustrate that each row of the self-attention matrix can be represented as a mixture of experts. Our analysis shows that ''experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention. We corroborate our theoretical findings through extensive experiments on both synthetic and real-world datasets.

Problem

Research questions and friction points this paper is trying to address.

Transformer Architecture

Sigmoid vs Softmax

Attention Mechanism Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

sigmoid self-attention

computational efficiency

mixture of experts

🔎 Similar Papers

Theory, Analysis, and Best Practices for Sigmoid Self-Attention