Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underutilization of both positive and negative reasoning trajectories in offline knowledge distillation for large language models (LLMs). We propose REINFORCEMENT DISTILLATION (REDI), a novel objective that requires neither a reference model nor explicit reward modeling. Methodologically, we introduce a two-stage framework: (1) supervised fine-tuning on teacher-generated correct reasoning traces; and (2) joint optimization over both positive and negative trajectories via REDI, explicitly modeling relative path quality. This departs from conventional paradigms relying solely on positive samples or preference-based objectives like DPO or SimPO. Evaluated on MATH-500, Qwen-REDI-1.5B achieves 83.1% pass@1—surpassing DeepSeek-R1-Distill-Qwen-1.5B (trained on 800K private samples) despite using only 131K open-source data—establishing new state-of-the-art performance for offline distillation in the 1.5B parameter regime.

Technology Category

Application Category

📝 Abstract
Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.
Problem

Research questions and friction points this paper is trying to address.

Leveraging both positive and negative reasoning traces for LLM distillation
Improving offline LLM reasoning performance via Reinforcement Distillation
Maximizing small model capabilities using openly available teacher data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses positive and negative reasoning traces
Two-stage Reinforcement Distillation framework
Reference-free loss function outperforms DPO
🔎 Similar Papers
No similar papers found.
S
Shuyao Xu
National University of Singapore, INFLY TECH (Shanghai) Co., Ltd.
C
Cheng Peng
INFLY TECH (Shanghai) Co., Ltd.
J
Jiangxuan Long
INFLY TECH (Shanghai) Co., Ltd.
Weidi Xu
Weidi Xu
Infly Technology
W
Wei Chu
INFLY TECH (Shanghai) Co., Ltd.
Y
Yuan Qi
INFLY TECH (Shanghai) Co., Ltd., AI3 Institute of Fudan University