🤖 AI Summary
This work addresses the limitations of reinforcement learning in open-ended tasks, where ambiguous query structures and sparse or unreliable reward signals often hinder performance. The authors propose QUBRIC, a novel framework that jointly optimizes queries and scoring criteria for the first time. QUBRIC leverages a teacher model to extract key points and reformulates open-ended questions into evaluable, contextualized queries. It then constructs high-quality query-scoring pairs through contrastive generation and a learnable filter based on answerability, which are used to train models via GRPO. By moving beyond fixed query distributions, QUBRIC effectively mitigates reward sparsity. Experiments demonstrate that QUBRIC improves performance by 5.5 points over SFT baselines on ArenaHard and achieves an average gain of 6.3 points across three unseen benchmarks—legal, moral, and narrative reasoning—significantly enhancing complex reasoning capabilities.
📝 Abstract
Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.