Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

📅 2024-06-13
🏛️ Neural Information Processing Systems
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates inter-layer information propagation in Transformer language models, focusing on how features are encoded, routed, and form cross-layer communication channels within low-rank subspaces. We identify and empirically validate the existence of “position-indexed 3D subspaces,” revealing that “contextual item crowding” is the root cause of failure in sequential sensitivity across multiple items. Methodologically, we integrate singular value decomposition (SVD), residual stream subspace analysis, low-rank feature tracking, and intervention experiments on a synthetic task (Laundry List). Crucially, we achieve the first interpretable weight editing and representation intervention grounded in this subspace structure: on the Laundry List task, accuracy improves by over 20%; we successfully predict cross-layer attention interactions; and we provide faithful, mechanistic attributions for model failures.

Technology Category

Application Category

📝 Abstract
Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task, and find that it underlies a commonly used abstraction across many context-retrieval behaviors. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers, forming low-rank communication channels (Elhage et al., 2021) between layers. A particular 3D subspace in model activations in GPT-2 can be traversed to positionally index items in lists, and we show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. That is, the model has trouble copying the correct information from context when many items ``crowd"this limited space. By decomposing attention heads with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices alone. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.
Problem

Research questions and friction points this paper is trying to address.

Understanding inter-layer communication in transformer language models.
Analyzing how models represent and route information between layers.
Improving model performance on context-retrieval tasks using discovered mechanisms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank subspaces for inter-layer communication
SVD decomposes attention heads for prediction
Model weight editing improves task accuracy
🔎 Similar Papers
No similar papers found.