ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

190K/year
πŸ€– AI Summary
This work addresses the limitation of existing large language model (LLM) search agents that rely solely on outcome-based rewards, which yields ineffective gradients and insufficient process-level supervision when training samples exhibit homogeneous results. To overcome this, the authors propose a reusable process reward framework that leverages a shared, dynamically updated buffer of general scoring criteria across queries to sparsely perform pairwise evaluations of search trajectories. These process-level scores are integrated into the base reward signal. Combining contrastive trajectory induction, criterion aggregation with retire strategies, and policy gradient optimization, the approach achieves efficient, consistent, and sustainable process supervision for the first time. Evaluated on four multi-hop question answering benchmarks, it significantly outperforms GRPO and DAPO baselines, improving LLM judge accuracy by up to 4.2 points and restoring informative gradients for up to 42% of previously non-informative training groups.
πŸ“ Abstract
LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.
Problem

Research questions and friction points this paper is trying to address.

process supervision
outcome-only reward
search agents
reward sparsity
rubric consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

process supervision
reusable rubric buffer
online reward
search agents
contrastive trajectories
πŸ”Ž Similar Papers
No similar papers found.
Z
Zheng Liu
Tsinghua University
L
Longxiang Zhang
Alibaba Group
X
Xintong Wang
Alibaba Group
Z
Zhiang Xu
Alibaba Group
Shaoxiong Zhan
Shaoxiong Zhan
Tsinghua University
Natural Language ProcessingLarge Language Model
X
Xin Shan
Peking University
Wen Huang
Wen Huang
Tsinghua University
Generative model
Tao Dai
Tao Dai
Shenzhen University
image restorationcomputer visiondeep learning
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security
C
Chengfu Huo
Alibaba Group
L
Liang Ding
Alibaba Group