🤖 AI Summary
This work investigates the geometric imprints left by pretraining and alignment in the weight space of Transformer models and their underlying causes. Through subspace alignment analysis, gradient covariance modeling, optimization trajectory tracking, and rank-1 interventions, the study systematically uncovers an asymmetric update pattern between read and write paths: alignment updates concentrate along dominant directions in the read path, while the write path remains nearly isotropic. The authors propose an “anisotropic gradient accumulation” mechanism to explain this phenomenon and validate its efficacy via comparative objective control and causal intervention experiments. These findings offer a geometric perspective and theoretical foundation for understanding how alignment reshapes pretrained models.
📝 Abstract
Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.