🤖 AI Summary
This study investigates the genuine existence and scale of local linear structures in large language models, critically examining the "fixed task plane" hypothesis. By analyzing weight and activation dynamics in synthetic multi-task Transformers and LoRA fine-tuning, the authors find that task gradients exhibit strong local low-rankness yet are not static. Building on this observation, they propose trajectory prefix bases to effectively capture recovery directions and develop a theoretical framework for local linearity under high-dimensional random search. Experiments on DistilGPT-2, GPT-2, and Qwen-0.5B demonstrate that local linear structures account for 77% of LoRA recovery displacement, and the cosine similarity between single-step gradients and CAA steering vectors reaches 0.58, providing significant empirical support for the interpretability of local linear structures in model behavior.
📝 Abstract
Task vectors, LoRA, activation steering, and random search around pretrained weights all suggest that learned behaviour can be controlled by linear directions. We ask which linear structures actually exist and on what scale. In a synthetic multitask transformer and LoRA adapters on DistilGPT-2 / GPT-2 we find strong local low-rank task-gradient structure but reject the fixed-task-plane hypothesis: static bases miss the recovery direction, and the useful basis drifts substantially within 100 steps. However, the first recovery updates form a trajectory-prefix basis capturing 77% of the LoRA recovery displacement. We develop random search theory with a Gaussian local-linear theorem that justifies the effectiveness of random parameter search even in very high dimensions. We also study the relation between parameter perturbations and activation steering: a single gradient step produces an activation shift with 0.58 cosine to a labelled-contrast CAA steering vector, with a similar steering effect on Qwen-0.5B BoolQ statements. We validate our results with experiments on synthetic Transformers and LLMs. Our results suggest that linear structures in trained networks are not global task directions, but evolving local geometries that partially persist across parameter and activation spaces.