🤖 AI Summary
This study investigates why attention mechanisms in Transformers spontaneously develop copy subcircuits—such as copy heads—during training. Framed within a Bayesian perspective, the work models feature learning in a single-layer Softmax attention network performing a copying task. By deriving the posterior distribution of the attention matrix and projecting it onto an order-parameter space, the authors uncover a data-size-driven phase transition. For the first time from first principles, they theoretically explain the emergence of copying behavior: Softmax attention exhibits a first-order phase transition, whereas linear attention undergoes a smooth evolution following a second-order phase transition. Validated through Adam optimization and Bayesian sampling, this work elucidates fundamental differences in learning dynamics between the two attention mechanisms, offering a theoretical foundation for emergent capabilities in large language models.
📝 Abstract
Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emph{first-order phase transition} while in linear attention an initial \emph{second-order phase transition} is followed by a smooth, continuous evolution toward the structured attention pattern (\emph{crossover}). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training large language models.