Subliminal Learning Is Steering Vector Distillation

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work investigates how semantically irrelevant data can enable a student model to implicitly acquire specific semantic preferences of a teacher model, such as those encoded in system prompts. We formalize this sub-threshold learning phenomenon for the first time as a “steering vector distillation” process and demonstrate its dependence on both model architecture and adaptive optimizers. Through steering vector analysis, fine-tuning experiments, activation gradient tracing, and optimizer ablation studies, we show that both semantically meaningful and random steering vectors can be effectively distilled into the student. Our findings further elucidate the inherent limitations of this mechanism with respect to model capacity and highlight the critical role played by adaptive optimization in enabling successful distillation.
📝 Abstract
Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.
Problem

Research questions and friction points this paper is trying to address.

subliminal learning
steering vector
language models
system prompt
model distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

subliminal learning
steering vector
steering vector distillation
activation steering
adaptive optimization
🔎 Similar Papers