Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

164K/year
πŸ€– AI Summary
This work addresses the limitation of existing negative sampling reinforcement methods, which employ fixed penalty strategies that fail to differentiate the severity of reasoning errors, thereby constraining the improvement of large language models’ reasoning capabilities. To overcome this, we propose an adaptive negative sampling reinforcement framework that incorporates a time-varying scheduling function to dynamically adjust error-correction intensity throughout training stages. Furthermore, we introduce a confidence-weighted mechanism based on normalized sequence likelihood to apply differentiated penalties according to error severity, achieving a dynamic balance between exploration and correction. Implemented within the RLVR paradigm, our approach enables fine-grained token-level updates and significantly outperforms PPO and GRPO on Qwen2.5-Math-1.5B, establishing state-of-the-art results across challenging mathematical reasoning benchmarks including MATH, AIME 2025, and AMC23, while enhancing both accuracy and diversity in reasoning outputs.
πŸ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model. As training continues, it shifts toward more subtle and controlled updates. We also introduce Confidence-Weighted Negative Reinforcement, which operates on the principle that different mistakes carry different levels of importance. CW-NSR assigns specific penalty weights based on the model's normalized sequence likelihood. If the model is highly confident in a wrong path, it receives a larger penalty and for uncertain errors -- where the model is effectively exploring -- are penalized less strictly. Our formal analysis shows how these mechanisms govern token-level updates, allowing the model to leverage prior-guided probability redistribution while providing a natural defense against overfitting. We evaluated these methods on difficult reasoning datasets, including MATH, AIME 2025, and AMC23, using the Qwen2.5-Math-1.5B architecture.
Problem

Research questions and friction points this paper is trying to address.

Negative Sample Reinforcement
Reinforcement Learning
Large Language Models
Adaptive Penalty
Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Negative Sample Reinforcement
Confidence-Weighted Negative Reinforcement
Reinforcement Learning with Verifiable Rewards
LLM Reasoning
Dynamic Penalty Scheduling
πŸ”Ž Similar Papers