Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers

📅 2024-06-05

📈 Citations: 2

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work investigates the functional division of labor between feed-forward and attention layers in Transformers regarding knowledge representation and contextual reasoning. To this end, we design controlled synthetic tasks—including bigram modeling and context-dependent inference—and combine theoretical gradient sensitivity analysis with interpretability ablation experiments on the Pythia model family. Our study provides the first unified empirical and theoretical confirmation that feed-forward layers primarily capture distributional statistics of input tokens and are essential for distributional prediction, whereas attention layers specialize in dynamic contextual reasoning; their removal severely impairs long-range dependency handling. Crucially, we identify differential gradient noise—arising from distinct parameter update dynamics—as the intrinsic mechanism underlying this functional specialization. These findings offer both theoretical grounding and empirical validation for modular computation in large language models, advancing our understanding of how architectural components contribute to different aspects of language processing.

Technology Category

Application Category

📝 Abstract

Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Distinguishing feed-forward and attention layers' roles in language models.

Examining distributional associations versus in-context reasoning in Transformers.

Analyzing gradient noise impact on layer-specific learning behaviors.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward layers learn simple distributional associations.

Attention layers focus on in-context reasoning tasks.

Gradient noise identified as key factor in discrepancy.

🔎 Similar Papers

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures