When Context Returns: Toward Robust Internalization in On-Policy Distillation

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In policy distillation, student models often exhibit performance degradation—termed “context-induced degradation”—when privileged context is reintroduced after being internalized during training. To address this, this work introduces the notion of “context removability” and proposes a lightweight consistency regularization method. The approach anchors the context-free output via a stop-gradient operation and enforces alignment by penalizing deviations in the context-conditioned output through a forward KL divergence computed with a single additional forward pass. Evaluated across diverse domains and architectures, the method substantially mitigates context-induced degradation: it reduces context harm in 11 out of 12 configurations, frequently improves contextual accuracy, and effectively curbs response length inflation.

📝 Abstract

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

Problem

Research questions and friction points this paper is trying to address.

context-induced degradation

on-policy distillation

context removability

privileged context

robust internalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

context removability

context-induced degradation