On Vanishing Variance in Transformer Length Generalization

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Transformers exhibit severe degradation in generalization performance on sequence length extrapolation tasks, raising concerns about their reasoning capabilities. This work identifies, for the first time, a critical mechanism: the output variance of multi-head attention decays with increasing sequence length—termed “variance collapse”—leading to hidden-layer distribution shift and consequent generalization failure. Building upon this mechanistic analysis, we propose relocating LayerNorm to immediately after the multi-head attention module (“post-LN”), a simple, parameter-free, and plug-and-play architectural modification. Evaluated on long-range dependency benchmarks—including argmax retrieval and dictionary lookup—our approach substantially improves length extrapolation performance. These results confirm that variance decay is a fundamental bottleneck and demonstrate that post-LN effectively mitigates distribution shift. Our findings provide new insights into Transformer inductive biases and offer a principled direction for enhancing their extrapolation capabilities.

Technology Category

Application Category

📝 Abstract

It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.

Problem

Research questions and friction points this paper is trying to address.

Transformers fail to generalize from short to long sequences

Decreased variance in attention outputs affects length generalization

Layer normalization reduces distribution shift from vanishing variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes vanishing variance in Transformer attention

Proposes post-attention layer normalization

Improves length generalization performance

🔎 Similar Papers

Approximation Rate of the Transformer Architecture for Sequence Modeling