Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

In long chain-of-thought (CoT) reasoning with large language models, negative samples—often discarded during training—contain valuable self-reflection and error-correction signals that remain underutilized. Method: This paper proposes BCPG-NSA, a fine-grained offline reinforcement learning framework. It introduces: (1) step-level negative sample parsing with a consensus-driven correctness discrimination mechanism; (2) behavior-constrained policy gradient (BCPG) for stable gradient updates; and (3) negative sample augmentation (NSA), which precisely identifies and reuses correct reasoning steps embedded in erroneous responses. Results: Experiments demonstrate that BCPG-NSA significantly outperforms supervised fine-tuning (RFT) and RL baselines on mathematical and code reasoning benchmarks. It improves sample efficiency under identical data budgets and maintains robustness and scalability across multiple iterative training rounds.

Technology Category

Application Category

📝 Abstract

Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.

Problem

Research questions and friction points this paper is trying to address.

Maximizing utility of fixed datasets for long CoT reasoning models

Leveraging valuable components in negative responses for learning

Improving sample efficiency in math/coding reasoning benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavior Constrained Policy Gradient with Negative Sample Augmentation

Fine-grained offline RL framework with three stages

Mining positive steps within negative samples effectively

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting