SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

๐Ÿ“… 2026-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

187K/year
๐Ÿค– AI Summary
Existing self-play methods rely on rule-based validation or human annotations, limiting their applicability to open-ended tasks. This work proposes SCOPE, a framework that achieves, for the first time, fully self-supervised training for open-ended tasks without external data. SCOPE co-evolves challenger and solver policies: the challenger generates tasks from documents, while the solver answers via multi-turn retrieval-augmented generation. After an initial model is frozen, it serves as a self-critic that dynamically produces task-specific scoring rubrics and evaluates responses. Evaluated across three 7โ€“8B models, SCOPE yields an average improvement of 10.4 points over baselines and surpasses GRPO_dataโ€”a method trained with approximately 9K human-written promptsโ€”while achieving gains of up to 13.8 points on seven unseen short-answer benchmarks.
๐Ÿ“ Abstract
Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.
Problem

Research questions and friction points this paper is trying to address.

self-play
open-ended tasks
language models
external supervision
task generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-play
co-evolving policies
open-ended tasks
self-judging
retrieval-augmented generation
๐Ÿ”Ž Similar Papers