Constitutional On-Policy Safe Distillation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the degradation in expressiveness observed in existing constitution-based safety alignment methods during multi-policy self-distillation, where overly conservative teacher models struggle to balance safety and usefulness. The study formally characterizes, for the first time, the geometric leakage problem inherent in safe distillation and introduces a novel paradigm that integrates cold-start cross-supervised fine-tuning (Cross-SFT) with constitutionally conditioned online policy distillation. To further enhance performance, the approach incorporates Reverse KL optimization and non-orthogonal semantic space modeling. Evaluated across twelve benchmarks, the proposed method significantly outperforms baseline approaches, effectively mitigating the loss of general reasoning capabilities while improving safety, thereby achieving a superior trade-off between safety and utility.

📝 Abstract

On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

safety alignment

constitutional AI

distribution collapse

expressiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constitutional AI

On-Policy Distillation

Geometric Leakage