🤖 AI Summary
Transformers exhibit severe degradation in generalization performance on sequence length extrapolation tasks, raising concerns about their reasoning capabilities. This work identifies, for the first time, a critical mechanism: the output variance of multi-head attention decays with increasing sequence length—termed “variance collapse”—leading to hidden-layer distribution shift and consequent generalization failure. Building upon this mechanistic analysis, we propose relocating LayerNorm to immediately after the multi-head attention module (“post-LN”), a simple, parameter-free, and plug-and-play architectural modification. Evaluated on long-range dependency benchmarks—including argmax retrieval and dictionary lookup—our approach substantially improves length extrapolation performance. These results confirm that variance decay is a fundamental bottleneck and demonstrate that post-LN effectively mitigates distribution shift. Our findings provide new insights into Transformer inductive biases and offer a principled direction for enhancing their extrapolation capabilities.
📝 Abstract
It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.