Deep Research as Rubric for Reinforcement Learning

πŸ“… 2026-05-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

203K/year
πŸ€– AI Summary
This work addresses the challenge of unreliable automatic evaluation signals in open-ended reasoning and long-form text generation, where conventional scoring rubrics struggle to capture knowledge-intensive dimensions, leading to distorted rewards. The authors propose DR-rubric, a two-stage framework that first employs multi-round agent-based search to uncover domain-specific facts, structural constraints, and failure modes, then distills these insights into atomic, independently verifiable constraints for GRPO policy optimization. Innovatively framing rubric construction as a dynamic research process, the approach replaces static templates with evidence-driven rule generation, enabling high-quality, self-bootstrapped scoring rules without reliance on state-of-the-art large language models. Evaluated across six benchmarks with only 1K–3K samples, the method significantly outperforms baselines: GPT-5-derived rules enhance coverage breadth, Gemini-based rules balance task performance, and iteratively refined self-bootstrapped rules achieve optimal overall results after three iterations.
πŸ“ Abstract
Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.
Problem

Research questions and friction points this paper is trying to address.

open-ended reasoning
reward signal
rubric construction
long-form generation
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric construction
reinforcement learning
agentic search
GRPO
bootstrap generation
πŸ”Ž Similar Papers
No similar papers found.