π€ AI Summary
This work addresses the limitation of existing large language model (LLM) search agents that rely solely on outcome-based rewards, which yields ineffective gradients and insufficient process-level supervision when training samples exhibit homogeneous results. To overcome this, the authors propose a reusable process reward framework that leverages a shared, dynamically updated buffer of general scoring criteria across queries to sparsely perform pairwise evaluations of search trajectories. These process-level scores are integrated into the base reward signal. Combining contrastive trajectory induction, criterion aggregation with retire strategies, and policy gradient optimization, the approach achieves efficient, consistent, and sustainable process supervision for the first time. Evaluated on four multi-hop question answering benchmarks, it significantly outperforms GRPO and DAPO baselines, improving LLM judge accuracy by up to 4.2 points and restoring informative gradients for up to 42% of previously non-informative training groups.
π Abstract
LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.