Self Rewarding Self Improving

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Large language models (LLMs) struggle to perform reinforcement learning (RL) in the absence of reference answers or human-annotated rewards. Method: We propose a self-judgment mechanism that requires neither manual annotation nor programmatically designed rewards. Leveraging the intrinsic asymmetry between generated solutions and formal verification, our approach constructs reliable reward signals, establishing a closed-loop self-improvement paradigm: “synthetic problem generation → solution generation → self-evaluation → optimization.” It integrates Proximal Policy Optimization (PPO) with mathematical reasoning fine-tuning. Contribution/Results: To our knowledge, this is the first work achieving pure self-supervised LLM self-improvement. On Qwen2.5-7B, it improves mathematical reasoning accuracy by 8% and surpasses GPT-4o on integral calculus tasks. Reward generalization is validated in fully unsupervised settings—including Countdown and the MIT Integration Bee—demonstrating robustness and advancing LLMs toward autonomous, continual learning paradigms.

Technology Category

Application Category

📝 Abstract

We demonstrate that large language models can effectively self-improve through self-judging without requiring reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments on Countdown puzzles and MIT Integration Bee problems show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains previously not possible. By implementing self-judging, we achieve significant performance gains maintaining alignment with formal verification. When combined with synthetic question generation, we establish a complete self-improvement loop where models generate practice problems, solve them, and evaluate their own performance-achieving an 8% improvement with Qwen 2.5 7B over baseline and surpassing GPT-4o performance on integration tasks. Our findings demonstrate that LLM judges can provide effective reward signals for training models, unlocking many reinforcement learning environments previously limited by the difficulty of creating programmatic rewards. This suggests a potential paradigm shift toward AI systems that continuously improve through self-directed learning rather than human-guided training, potentially accelerating progress in domains with scarce training data or complex evaluation requirements.

Problem

Research questions and friction points this paper is trying to address.

LLMs self-improve via self-judging without reference solutions

Models provide reliable rewards without ground truth answers

Self-judging enables reinforcement learning in new domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-judging enables reliable reward signals

Synthetic question generation completes self-improvement loop

LLM judges unlock reinforcement learning environments

🔎 Similar Papers

Self-Rewarding Language Models