Deep Research as Rubric for Reinforcement Learning

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of unreliable automatic evaluation signals in open-ended reasoning and long-form text generation, where conventional scoring rubrics struggle to capture knowledge-intensive dimensions, leading to distorted rewards. The authors propose DR-rubric, a two-stage framework that first employs multi-round agent-based search to uncover domain-specific facts, structural constraints, and failure modes, then distills these insights into atomic, independently verifiable constraints for GRPO policy optimization. Innovatively framing rubric construction as a dynamic research process, the approach replaces static templates with evidence-driven rule generation, enabling high-quality, self-bootstrapped scoring rules without reliance on state-of-the-art large language models. Evaluated across six benchmarks with only 1K–3K samples, the method significantly outperforms baselines: GPT-5-derived rules enhance coverage breadth, Gemini-based rules balance task performance, and iteratively refined self-bootstrapped rules achieve optimal overall results after three iterations.

📝 Abstract

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

Problem

Research questions and friction points this paper is trying to address.

open-ended reasoning

reward signal

rubric construction

long-form generation

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric construction

reinforcement learning

agentic search