Subliminal Learning is a LoRA Artifact

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study investigates the phenomenon of “subconscious learning,” wherein language models appear to transfer behavioral traits—such as a preference for cats—through innocuous data. Through comparative experiments between LoRA and full fine-tuning, complemented by context ablation and behavioral localization analyses, the work demonstrates that this effect is not a genuine mechanism of behavior transmission but rather an unstable artifact induced by specific LoRA hyperparameters and shared contextual elements (e.g., system prompts). The strength of the observed phenomenon exhibits an inverted U-shaped relationship with LoRA rank and vanishes entirely under full fine-tuning. These findings indicate that subconscious learning is an unreliable and configuration-dependent artifact rather than a robust or inherent property of model training.

📝 Abstract

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

Problem

Research questions and friction points this paper is trying to address.

subliminal learning

behavioral transmission

language models

LoRA

finetuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

subliminal learning

LoRA artifact

behavioral transmission