Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key limitation in existing steering vector methods for large language models, which typically apply interventions at fixed layers and thereby overlook the fact that the optimal intervention layer should vary dynamically with the input. To overcome this, the authors propose an input-dependent adaptive intervention mechanism that learns a mapping function from input embeddings to the optimal steering layer, enabling dynamic selection of intervention positions. This approach introduces, for the first time, an input-conditioned strategy for dynamic layer selection, combining lightweight steering vectors with representation analysis to achieve superior behavioral alignment across multiple mainstream large language models. Experimental results demonstrate that the method significantly outperforms fixed-layer baselines under both in-distribution and out-of-distribution settings, confirming the effectiveness and generalizability of adaptive intervention.
📝 Abstract
Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.
Problem

Research questions and friction points this paper is trying to address.

LLM alignment
steering vectors
input-dependent
layer selection
representation modulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

input-dependent steering
adaptive layer selection
LLM alignment
steering vectors
representation intervention
S
Soham Gadgil
University of Washington
C
Chris Lin
University of Washington
Su-In Lee
Su-In Lee
Computer Science & Engineering, University of Washington
AIMLComputational biology & medicine