Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

πŸ“… 2026-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing rubric-based training methods, which rely on large language model verifiers and suffer from high computational overhead, verifier bias, and sparse feedback. To overcome these issues, the authors propose a novel verifier-free self-distillation approach that leverages a rubric-conditioned teacher policy to provide dense, token-level supervision to an unconditional student policy, enabling rubric-guided training without any external verifier for the first time. The method requires only a single online rollout to transform trajectory-level sparse rewards into fine-grained supervisory signals. Experiments in medical and scientific domains demonstrate that, when applied to post-train Qwen-series models, this approach achieves rubric compliance comparable to verifier-based GRPOβ€”without invoking any verifier during training.
πŸ“ Abstract
Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.
Problem

Research questions and friction points this paper is trying to address.

rubric-based training
LLM verifier
training overhead
sparse reward
verifier bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-Guided Self-Distillation
verifier-free training
dense token-level distillation
post-training alignment
rubric-based feedback
πŸ”Ž Similar Papers
No similar papers found.