ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the limitation of black-box knowledge distillation for cross-architecture large language models (LLMs), where conventional methods fail to capture teacher-student discrepancies in reasoning processes. We propose a preference-optimization-based hybrid distillation framework. Instead of explicitly mimicking chain-of-thought reasoning, our method constructs trajectory pairs from diverse reasoning paths of both teacher and student models, and employs Odds-Ratio Preference Optimization (ORPO) to model relative path quality via odds ratios—enabling fine-grained, architecture-agnostic knowledge transfer. Crucially, the approach requires no access to teacher internal states or assumptions about decoding strategies. Extensive experiments across five benchmark datasets and heterogeneous student architectures—including Phi-3, Qwen2, and Llama3—demonstrate consistent and significant improvements over standard black-box, intra-policy, and out-of-policy distillation baselines, validating its generalizability and robustness.

Technology Category

Application Category

📝 Abstract

We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Un- like standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.

Problem

Research questions and friction points this paper is trying to address.

Optimizes cross-architecture LLM distillation via preference optimization

Transfers knowledge using diverse reasoning traces from teacher and student

Employs mixed-policy strategy to outperform standard distillation baselines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-architecture distillation via preference optimization

Odds-Ratio objective contrasting teacher-student reasoning traces

Mixed-policy strategy utilizing student-generated outputs

🔎 Similar Papers

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs