Looped Transformers with Layer Normalization Provably Learn the Power Method

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study investigates how linear recurrent Transformers equipped with layer normalization implicitly learn the power method through gradient descent when trained on principal component prediction tasks. The work reveals an “algorithmic implicit bias”: in the absence of explicit supervision, the self-attention layers automatically converge to solutions that implement power iterations, with each layer corresponding to one update step of the power method. Theoretical analysis demonstrates that layer normalization is essential for realizing the exact power method—models without it fail to replicate the algorithm, resulting in significantly degraded performance. This paper is the first to establish the pivotal role of layer normalization in inducing algorithmic inductive biases and provides provable guarantees for the resulting performance gap.

📝 Abstract

Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an "algorithmic implicit bias" of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.

Problem

Research questions and friction points this paper is trying to address.

Transformers

Layer Normalization

Power Method

Principal Component Prediction

Algorithmic Implicit Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped Transformers

Layer Normalization

Power Method