Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper addresses the safety degradation of small language models (SLMs) induced by chain-of-thought (CoT) distillation. We propose a harm-mitigating distillation framework that requires no additional annotations or computational overhead. Our method comprises two core techniques: (1) Slow Tuning—constraining weight update magnitudes during distillation to suppress harmful knowledge transfer; and (2) Low-Entropy Masking—dynamically masking low-entropy tokens in the CoT sequences to impede learning of latent harmful patterns. To our knowledge, this is the first work to systematically identify and mitigate the implicit safety threats posed by CoT distillation to SLMs. Experiments on benchmarks including BBH and ARC demonstrate that our approach significantly enhances reasoning capability while simultaneously improving safety metrics—such as robustness against adversarial attacks and reduction in harmful output rate—outperforming existing distillation methods. Thus, it achieves synergistic optimization of both safety and reasoning performance.

Technology Category

Application Category

📝 Abstract

Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model's safety in the early stage and the latter prolonging the safe training epochs.

Problem

Research questions and friction points this paper is trying to address.

Maintain SLM safety during CoT distillation

Avoid extra computation or annotated data

Balance safety and reasoning capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow Tuning minimizes weight changes for safety

Low-Entropy Masking excludes unnecessary token learning

SLowED combines both for safe CoT distillation

🔎 Similar Papers

No similar papers found.