SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the degradation of general capabilities in large language models due to “alignment tax” when aligning with human values. Existing approaches rely heavily on extensive general-domain data or auxiliary reward models, incurring substantial computational and data costs. To overcome this, the paper proposes SafeSteer, which formulates safe alignment as a sparse, local optimization problem within the model’s output distribution. SafeSteer constructs a safety-focused teacher model via activation intervention, introduces an algorithm to identify safety-critical tokens, and applies reverse KL regularization only on these selected tokens. Remarkably, SafeSteer achieves strong safety performance across seven benchmarks using merely 100 harmful examples—less than 1% of the alignment data required by prior methods—and without any general-domain data, while incurring minimal performance loss on five general capability benchmarks.

📝 Abstract

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.

Problem

Research questions and friction points this paper is trying to address.

alignment tax

safety alignment

large language models

capability degradation

efficient alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

safety alignment

activation steering

localized modification

alignment tax

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?

2024-05-27arXiv.orgCitations: 21