Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

📅 2026-01-30

📈 Citations: 1

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the scarcity of verifiable reward data in reinforcement learning for verifiable reasoning (RLVR), which hinders the advancement of complex reasoning capabilities in large language models. To overcome this limitation, the authors propose a novel method that leverages unverifiable yet reasoning-rich textual sources—such as scientific textbooks—and automatically transforms them into multiple-choice cloze tasks by masking critical reasoning steps and generating diverse distractors using a large language model. This approach yields GooseReason, a large-scale RLVR dataset that circumvents the traditional reliance on human annotation or strictly verifiable data. Evaluated across 15 benchmarks, models trained with this data achieve state-of-the-art performance, setting new records for 1.5B and 4B parameter models and even surpassing specialized 7B models on cybersecurity reasoning tasks.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

data scarcity

unverifiable internet text

reasoning tasks

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR

data synthesis

unverifiable text