🤖 AI Summary
Existing activation intervention methods struggle to disentangle angular and norm components in concept representations, leading to unstable control effects and opaque mechanisms. This work proposes an angle-norm decomposition framework and, through controlled experiments and evaluations across seven language models, empirically demonstrates that conceptual information is primarily encoded in the angular structure, while the norm plays a critical role in intervention stability. The study elucidates the geometric origins underlying the performance differences among various intervention approaches and advocates for parameterizing interventions via interpretable angular and radial components. This provides both theoretical grounding and practical guidance for achieving more stable and interpretable control over language model behaviors.
📝 Abstract
Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.