Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing approaches to constructing verifiable environments struggle to scale linearly, thereby limiting the reasoning generalization capabilities of large language models in reinforcement learning. This work proposes RACES, a framework that, for the first time, enables automatic recursive composition of environments based on input-output type matching. Treating verifiable environments as composable building blocks, RACES employs four operators—SEQUENTIAL, PARALLEL, SORT, and SELECT—to automatically generate complex tasks from a base set of 300 environments. By overcoming the linear scalability bottleneck in environment construction, the method achieves an average improvement of 3.1 points across six unseen test benchmarks and demonstrates that training with only 50 base environments can match the performance obtained using 300 independently designed environments.

📝 Abstract

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

Problem

Research questions and friction points this paper is trying to address.

verifiable environments

reasoning generalization

reinforcement learning

scalability

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

recursive composition

verifiable environments

reasoning generalization