Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers

📅 2024-06-05
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the functional division of labor between feed-forward and attention layers in Transformers regarding knowledge representation and contextual reasoning. To this end, we design controlled synthetic tasks—including bigram modeling and context-dependent inference—and combine theoretical gradient sensitivity analysis with interpretability ablation experiments on the Pythia model family. Our study provides the first unified empirical and theoretical confirmation that feed-forward layers primarily capture distributional statistics of input tokens and are essential for distributional prediction, whereas attention layers specialize in dynamic contextual reasoning; their removal severely impairs long-range dependency handling. Crucially, we identify differential gradient noise—arising from distinct parameter update dynamics—as the intrinsic mechanism underlying this functional specialization. These findings offer both theoretical grounding and empirical validation for modular computation in large language models, advancing our understanding of how architectural components contribute to different aspects of language processing.

Technology Category

Application Category

📝 Abstract
Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing feed-forward and attention layers' roles in language models.
Examining distributional associations versus in-context reasoning in Transformers.
Analyzing gradient noise impact on layer-specific learning behaviors.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward layers learn simple distributional associations.
Attention layers focus on in-context reasoning tasks.
Gradient noise identified as key factor in discrepancy.
🔎 Similar Papers
No similar papers found.
L
Lei Chen
Courant Institute of Mathematical Sciences, New York University
Joan Bruna
Joan Bruna
Professor of Computer Science, Data Science & Mathematics (aff), Courant Institute and CDS, NYU
Machine Learning
A
A. Bietti
Flatiron Institute