🤖 AI Summary
Large language models (LLMs) often rely on costly human annotations or external reward models for reasoning alignment, hindering scalability and practical deployment.
Method: This paper proposes a self-supervised reinforcement learning framework that leverages the model’s intrinsic token-level confidence as an annotation-free, reward-model-free RL signal—eliminating the need for preference data or human feedback. It employs policy-gradient-driven self-confidence reward modeling combined with low-rank adaptation fine-tuning.
Contribution/Results: Applied to Qwen2.5-Math-7B, the method achieves zero-shot reasoning alignment with only eight examples per problem and four training iterations, yielding accuracy improvements of 20.10%, 49.40%, and 52.50% on AIME2024, MATH500, and AMC23, respectively. It significantly reduces alignment cost while enhancing generalization across challenging mathematical reasoning benchmarks.
📝 Abstract
Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 8 samples per question and 4 training epochs, RLSC improves accuracy by +20.10% on AIME2024, +49.40% on MATH500, and +52.50% on AMC23. RLSC offers a simple, scalable post-training method for reasoning models with minimal supervision.