🤖 AI Summary
This work addresses the limitation of black-box knowledge distillation for cross-architecture large language models (LLMs), where conventional methods fail to capture teacher-student discrepancies in reasoning processes. We propose a preference-optimization-based hybrid distillation framework. Instead of explicitly mimicking chain-of-thought reasoning, our method constructs trajectory pairs from diverse reasoning paths of both teacher and student models, and employs Odds-Ratio Preference Optimization (ORPO) to model relative path quality via odds ratios—enabling fine-grained, architecture-agnostic knowledge transfer. Crucially, the approach requires no access to teacher internal states or assumptions about decoding strategies. Extensive experiments across five benchmark datasets and heterogeneous student architectures—including Phi-3, Qwen2, and Llama3—demonstrate consistent and significant improvements over standard black-box, intra-policy, and out-of-policy distillation baselines, validating its generalizability and robustness.
📝 Abstract
We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Un- like standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.