Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing rubric-based training methods, which rely on large language model verifiers and suffer from high computational overhead, verifier bias, and sparse feedback. To overcome these issues, the authors propose a novel verifier-free self-distillation approach that leverages a rubric-conditioned teacher policy to provide dense, token-level supervision to an unconditional student policy, enabling rubric-guided training without any external verifier for the first time. The method requires only a single online rollout to transform trajectory-level sparse rewards into fine-grained supervisory signals. Experiments in medical and scientific domains demonstrate that, when applied to post-train Qwen-series models, this approach achieves rubric compliance comparable to verifier-based GRPO—without invoking any verifier during training.

📝 Abstract

Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

Problem

Research questions and friction points this paper is trying to address.

rubric-based training

LLM verifier

training overhead

sparse reward

verifier bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-Guided Self-Distillation

verifier-free training

dense token-level distillation