Dynamic Short Convolutions Improve Transformers

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This work proposes the first systematic integration of dynamic short convolutions into the Transformer architecture as a novel neural primitive to enhance both expressivity and computational efficiency in language modeling while preserving locality-based inductive biases. The approach employs input-dependent convolutional kernels to dynamically modulate keys, queries, values, and linear layer outputs, achieving substantial performance gains without compromising hardware efficiency. Empirical evaluations across model scales from 150M to 2B parameters demonstrate consistent superiority over standard Transformers and static convolution variants, yielding 1.33–1.60× improvements in computational efficiency. Furthermore, the method exhibits generalizable benefits when incorporated into advanced architectures such as Mamba-2 and Mixture-of-Experts (MoE), underscoring its versatility and effectiveness.

📝 Abstract

Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations improves performance on challenging associative recall tasks compared with static convolutional variants. Across language-modeling experiments ranging from 150M to 2B parameters, dynamic convolutions consistently outperform standard Transformers and Transformers augmented with static short convolutions. Fitting scaling laws indicates a 1.33$\times$ compute advantage over compute-matched Transformers when dynamic convolutions are applied to the key, query, and value vectors, and a 1.60$\times$ advantage when adding dynamic convolutions after every linear layer. Dynamic convolutions also offer improvements on linear RNNs (Mamba-2/Gated DeltaNet) and mixture-of-experts architectures. We make these gains practical with custom Triton kernels that enable efficient training with a manageable end-to-end slowdown. These results suggest that dynamic short convolutions are a scalable, hardware-efficient, and expressive primitive for advancing Transformer-based language models.

Problem

Research questions and friction points this paper is trying to address.

Transformers

dynamic short convolutions

language modeling

computational efficiency

neural network primitives

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic short convolutions

Transformer enhancement

input-dependent filters