Diagnosing and Mitigating System Bias in Self-Rewarding RL

πŸ“… 2025-10-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In self-rewarding reinforcement learning (RLIR), systematic overestimation of high-confidence rollouts induces cumulative bias and training instability. This work is the first to identify and quantify three intrinsic biases: noise-induced bias, self-bias, and symmetric bias. To address them, we propose RLERβ€”a novel RLIR framework featuring (i) multi-model ensemble to mitigate individual model biases, (ii) dynamic reward interpolation to balance reward estimation confidence, and (iii) selective rollout to suppress erroneous high-confidence generations. Evaluated on unlabeled data, RLER achieves a 13.6% absolute performance gain over baseline RLIR and narrows the gap with supervised fine-tuning (RLVR) to only 3.6%. Moreover, RLER significantly improves training stability and scalability in continual learning settings, enabling robust adaptation without human-annotated rewards.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $ρ_{ ext{noise}}$, $ρ_{ ext{selfbias}}$, and $ρ_{ ext{symbias}}$. We find that $ρ_{ ext{noise}}$ and $ρ_{ ext{symbias}}$ impact convergence, while $ρ_{ ext{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
Problem

Research questions and friction points this paper is trying to address.

Self-rewarding RL suffers from system bias in reward estimation
Bias causes unstable training and performance gap with supervised RL
Proposed RLER method mitigates bias to enable stable unlabeled scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensembled rewards aggregate diverse models for bias mitigation
Adaptive reward interpolation and rollout selection stabilize training
Proposed RLER improves self-rewarding RL performance on unlabeled data
πŸ”Ž Similar Papers
No similar papers found.