NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing incentive-based training methods rely on external verifiers (e.g., mathematical or code executors) or expensive reward models, resulting in domain limitations and high annotation costs. This paper proposes Verifier-Free RL—a reinforcement learning framework that optimizes reasoning-path generation in language models end-to-end without external verifiers, using only standard supervised fine-tuning (SFT) data. Its core contributions include: (1) the first verifier-free incentive training paradigm; (2) localized reward modeling over answer segments to guide fine-grained reasoning; and (3) support for novel optimization modes such as inverse incentivization. Evaluated across diverse text-to-text tasks, Verifier-Free RL outperforms same-size distilled models by 7.7%, substantially enhancing reasoning capabilities while ensuring generality, low cost, and scalability.

Technology Category

Application Category

📝 Abstract

Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for external verifiers in reinforcement learning

Enables incentive training without annotated reward data

Expands applicability to diverse text-to-text tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-free reinforcement learning for language models

Uses standard supervised fine-tuning data only

Enables incentive training across text tasks

🔎 Similar Papers

No similar papers found.