A Simple Generalisation of the Implicit Dynamics of In-Context Learning

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing in-context learning (ICL) theory is largely confined to toy models and oversimplified architectures, lacking fidelity to realistic Transformer designs. Method: We systematically generalize Dherin et al. (2025)’s theory—which characterizes implicit pre-attention weight updates via abstract Transformer blocks—to a substantially broader setting: all sequence positions (not just the final token), arbitrary Transformer layers (not only the first), and full architectural realism including LayerNorm and residual connections. Our approach integrates implicit gradient dynamics modeling, theoretical analysis of abstract blocks, and linear ICL experiments. Contribution/Results: We uncover coupled implicit weight update mechanisms across tokens and layers. Empirical validation confirms strong agreement between theoretical predictions and observed behavior, markedly improving modeling fidelity and practical relevance of ICL theory. This work establishes the first scalable, structurally faithful dynamic analytical framework for ICL in large language models.

Technology Category

Application Category

📝 Abstract

In-context learning (ICL) refers to the ability of a model to learn new tasks from examples in its input without any parameter updates. In contrast to previous theories of ICL relying on toy models and data settings, recently it has been shown that an abstraction of a transformer block can be seen as implicitly updating the weights of its feedforward network according to the context (Dherin et al., 2025). Here, we provide a simple generalisation of this result for (i) all sequence positions beyond the last, (ii) any transformer block beyond the first, and (iii) more realistic residual blocks including layer normalisation. We empirically verify our theory on simple in-context linear regression tasks and investigate the relationship between the implicit updates related to different tokens within and between blocks. These results help to bring the theory of Dherin et al. (2025) even closer to practice, with potential for validation on large-scale models.

Problem

Research questions and friction points this paper is trying to address.

Generalizes implicit weight update theory for transformers

Extends analysis to all sequence positions and blocks

Validates theory on realistic architectures and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizes implicit weight updates across all transformer blocks

Extends theory to all sequence positions beyond the last token

Applies to realistic residual blocks with layer normalization

🔎 Similar Papers

The dynamic interplay between in-context and in-weight learning in humans and neural networks