Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work investigates over-refusal in aligned language models—the tendency to incorrectly reject safe instructions that superficially resemble harmful ones. It is the first to distinguish harmful refusal from over-refusal through the lens of representation subspaces: the former exhibits task-agnostic structure and can be captured by a single global direction, whereas the latter resides within high-dimensional, task-specific clusters of benign representations. Through geometric analysis of representations, linear probing, and targeted interventions on hidden states, the study reveals that both refusal mechanisms are encoded as early as the initial Transformer layers, and ablating the global refusal direction fails to mitigate over-refusal. These findings uncover fundamental geometric distinctions between types of refusal behavior and lay a theoretical foundation for developing task-adaptive alignment strategies that achieve greater precision.

Technology Category

Application Category

📝 Abstract

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.

Problem

Research questions and friction points this paper is trying to address.

over-refusal

aligned LLMs

representation subspaces

task-conditioned refusal

harmful requests

Innovation

Methods, ideas, or system contributions that make the work stand out.

over-refusal

representation subspaces

task-conditioned refusal