Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a critical limitation in reward-conditioned reinforcement learning: when all reasoning trajectories under the same prompt receive identical rewards, the absence of informative gradient signals leads to the wastage of numerous high-quality samples. To overcome this, the authors propose the Reasoning Arena framework, which constructs fine-grained relative rewards through pairwise tournament-style comparisons. By integrating a dynamic anchor pool with the Bradley-Terry model, the method efficiently estimates relative rankings over an incomplete comparison graph, effectively converting zero-advantage samples into useful gradient updates. Empirical results demonstrate substantial improvements—averaging a 7.6% performance gain on mathematical reasoning and code generation benchmarks, 27%–41% faster training convergence, and nearly 50% reduction in computational cost for sample generation.

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

verifiable rewards

reasoning traces

reward sparsity

relative preference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning Arena

trace tournaments

relative reward signals