Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Large language models (LLMs) often rely on costly human annotations or external reward models for reasoning alignment, hindering scalability and practical deployment. Method: This paper proposes a self-supervised reinforcement learning framework that leverages the model’s intrinsic token-level confidence as an annotation-free, reward-model-free RL signal—eliminating the need for preference data or human feedback. It employs policy-gradient-driven self-confidence reward modeling combined with low-rank adaptation fine-tuning. Contribution/Results: Applied to Qwen2.5-Math-7B, the method achieves zero-shot reasoning alignment with only eight examples per problem and four training iterations, yielding accuracy improvements of 20.10%, 49.40%, and 52.50% on AIME2024, MATH500, and AMC23, respectively. It significantly reduces alignment cost while enhancing generalization across challenging mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 8 samples per question and 4 training epochs, RLSC improves accuracy by +20.10% on AIME2024, +49.40% on MATH500, and +52.50% on AMC23. RLSC offers a simple, scalable post-training method for reasoning models with minimal supervision.

Problem

Research questions and friction points this paper is trying to address.

Aligns LLM behavior with task goals using self-confidence as reward

Eliminates need for human annotations or external reward models

Improves accuracy on math tasks with minimal supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses model's self-confidence as reward signals

Eliminates need for human annotations

Achieves high accuracy with minimal supervision

🔎 Similar Papers

On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems