Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the high computational cost and verbosity induced by long chain-of-thought (CoT) reasoning trajectories in knowledge distillation. The authors propose a post-processing compression method that significantly shortens these trajectories prior to distillation while preserving up to 96% of downstream task accuracy. By leveraging instruction-tuned models, they compress correct reasoning traces generated by Qwen3.5-397B-A17B and gpt-oss-120B, integrating efficient fine-tuning techniques such as LoRA during distillation. The compressed trajectories occupy only 8.6–21.0% of the original text length, reducing training tokens to 12–30% and accelerating training by 2.0–7.6×. Inference outputs are shortened by 3–19×, enabling smaller student models to achieve performance nearly on par with that obtained using full-length trajectories, thereby effectively balancing accuracy and efficiency.

📝 Abstract

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

Problem

Research questions and friction points this paper is trying to address.

reasoning trace compression

knowledge distillation

chain-of-thought

efficiency-accuracy trade-off

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning trace compression

knowledge distillation

chain-of-thought