Can Large Reasoning Models Self-Train?

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scalability bottleneck in large reasoning models caused by reliance on human-annotated validators. We propose a fully self-supervised online reinforcement learning framework that requires neither ground-truth labels nor hand-crafted validators. Methodologically, it leverages self-consistency to generate proxy reward signals and integrates rejection sampling, policy optimization, and reward modeling for autonomous training on mathematical reasoning tasks. Our key contributions are: (i) the first RL paradigm for reasoning that eliminates human validators entirely; and (ii) a mechanistic analysis revealing that proxy rewards are initially effective but prone to reward hacking, with quantified characterization of their bias evolution over training. Experiments demonstrate that our method rapidly approaches supervised RL performance without any human annotations, achieving substantial gains in reasoning accuracy. This advances a viable pathway toward self-evolving large language models.

Technology Category

Application Category

📝 Abstract
Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternative, but it incurs scalability limitations due to dependency upon human-designed verifiers. Self-training, where the model's own judgment provides the supervisory signal, presents a compelling direction. We propose an online self-training reinforcement learning algorithm that leverages the model's self-consistency to infer correctness signals and train without any ground-truth supervision. We apply the algorithm to challenging mathematical reasoning tasks and show that it quickly reaches performance levels rivaling reinforcement-learning methods trained explicitly on gold-standard answers. Additionally, we analyze inherent limitations of the algorithm, highlighting how the self-generated proxy reward initially correlated with correctness can incentivize reward hacking, where confidently incorrect outputs are favored. Our results illustrate how self-supervised improvement can achieve significant performance gains without external labels, while also revealing its fundamental challenges.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on human supervision in LLMs
Self-training using model's self-consistency for correctness
Addressing reward hacking in self-generated proxy rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online self-training reinforcement learning algorithm
Leverages model's self-consistency for correctness
No ground-truth supervision required