Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a key limitation in existing steering vector methods for large language models, which typically apply interventions at fixed layers and thereby overlook the fact that the optimal intervention layer should vary dynamically with the input. To overcome this, the authors propose an input-dependent adaptive intervention mechanism that learns a mapping function from input embeddings to the optimal steering layer, enabling dynamic selection of intervention positions. This approach introduces, for the first time, an input-conditioned strategy for dynamic layer selection, combining lightweight steering vectors with representation analysis to achieve superior behavioral alignment across multiple mainstream large language models. Experimental results demonstrate that the method significantly outperforms fixed-layer baselines under both in-distribution and out-of-distribution settings, confirming the effectiveness and generalizability of adaptive intervention.

📝 Abstract

Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

Problem

Research questions and friction points this paper is trying to address.

LLM alignment

steering vectors

input-dependent

layer selection

representation modulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

input-dependent steering

adaptive layer selection

LLM alignment

steering vectors

representation intervention

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?

2024-06-27arXiv.orgCitations: 48

A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models

2024-06-17Citations: 2

Authors to Follow