RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM reasoning enhancement methods rely on fixed or supervised verifiers, which suffer from reward hacking and poor generalization. Method: We propose the first dual-agent reinforcement learning framework jointly optimizing generator and verifier: a process-level generative verifier trained solely via outcome-correctness rewards—eliminating the need for manual process annotations—and an unsupervised process reward coupled with interleaved parameter updates, both implemented within a PPO framework to enable mutual co-improvement. Contribution/Results: On 7B/8B models, the generator achieves state-of-the-art performance across five mathematical competition benchmarks and four cross-domain reasoning tasks. The verifier attains leading results on ProcessBench, with particularly substantial gains on the most challenging mathematical problems.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning via joint RL training of generator and verifier
Overcoming reward hacking and poor generalization in RL post-training
Developing a generative process-level verifier without process-level annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL trains generator and verifier together
Generative process-level verifier via RL
Outcome-level rewards without process annotations
🔎 Similar Papers
No similar papers found.