π€ AI Summary
This work addresses the high computational cost and verbosity induced by long chain-of-thought (CoT) reasoning trajectories in knowledge distillation. The authors propose a post-processing compression method that significantly shortens these trajectories prior to distillation while preserving up to 96% of downstream task accuracy. By leveraging instruction-tuned models, they compress correct reasoning traces generated by Qwen3.5-397B-A17B and gpt-oss-120B, integrating efficient fine-tuning techniques such as LoRA during distillation. The compressed trajectories occupy only 8.6β21.0% of the original text length, reducing training tokens to 12β30% and accelerating training by 2.0β7.6Γ. Inference outputs are shortened by 3β19Γ, enabling smaller student models to achieve performance nearly on par with that obtained using full-length trajectories, thereby effectively balancing accuracy and efficiency.
π Abstract
Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.