Trading Human Curation for Synthetic Augmentation in RLVR

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the bottleneck in Reinforcement Learning from Verifiable Rewards (RLVR) for training language agents, where high-quality tasks rely heavily on costly and non-scalable manual curation. We propose a gated filtering approach for synthetic task augmentation that automatically generates and selects high-quality synthetic tasks to replace additional human-authored ones, requiring only a small set of human-curated benchmark tasks as seed data. For the first time, we quantify the cost-adjusted substitution rate (ρ_cost) between synthetic and human-authored tasks and validate our method across ten diverse benchmarks spanning code generation, instruction following, reasoning, and multi-turn agent interaction. Experimental results demonstrate that our approach achieves 1.4× to 11.6× improvement in cost efficiency while maintaining overall generalization performance.

📝 Abstract

The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $ρ_{\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $ρ_{\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\times, 11.6\times]$ across the plausible $c_{\text{human}}/c_{\text{aug}}$ range.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning from verifiable rewards

task curation bottleneck

synthetic task augmentation

training data scalability

reward function authoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic augmentation

reinforcement learning from verifiable rewards

cost-adjusted trade rate