Patches of Nonlinearity: Instruction Vectors in Large Language Models

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how large language models internally represent and process instructions during supervised fine-tuning (SFT) and direct preference optimization (DPO). Through causal mediation analysis, we find that instruction representations are highly localized in early network layers and introduce the concept of an “instruction vector”—a representation that effectively guides later layers to select task-relevant information pathways even under conditions of linear non-separability. Our work challenges the prevailing assumption in mechanistic interpretability that internal representations are linearly encoded, and instead proposes a novel method for identifying causal information pathways without relying on linearity. This reveals the instruction vector’s critical role as a selector of task-specific circuits within the model.

Technology Category

Application Category

📝 Abstract
Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.
Problem

Research questions and friction points this paper is trying to address.

instruction representation
mechanistic interpretability
non-linear interaction
language models
post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction Vectors
Nonlinear Causal Interaction
Mechanistic Interpretability
Circuit Selection
Linear Representation Hypothesis
I
Irina Bigoulaeva
Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany
J
Jonas Rohweder
Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany
Subhabrata Dutta
Subhabrata Dutta
TU Darmstadt
Machine LearningNatural Language ProcessingComputational Social Science
Iryna Gurevych
Iryna Gurevych
Full Professor, TU Darmstadt; Adjunct Professor, MBZUAI, UAE; Affiliated Professor, INSAIT, Bulgaria
Natural Language ProcessingLarge Language ModelsArtificial Intelligence