What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work identifies two root causes of Proximal Policy Optimization (PPO) failure in long-chain-of-thought (Long-CoT) reasoning: (1) value function initialization bias and (2) severe reward signal attenuation along the reasoning chain. To address these, we propose Value-Calibrated PPO (VC-PPO), featuring: (1) a pre-trained specialized value model to mitigate initialization bias, and (2) a decoupled generalized advantage estimation (GAE) scheme that computes advantages independently of the value prediction trajectory, thereby suppressing error propagation. VC-PPO jointly optimizes reinforcement learning and chain-of-thought generation. On the AIME mathematical reasoning benchmark, VC-PPO significantly improves PPO’s success rate. Ablation studies confirm that both components are critical for training stability. This is the first work to systematically diagnose and resolve the fundamental failure mechanisms of PPO in Long-CoT settings.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.

Problem

Research questions and friction points this paper is trying to address.

PPO fails in long CoT tasks due to value initialization bias.

Reward signal decay causes PPO collapse in long CoT tasks.

VC-PPO improves PPO performance by addressing initialization and decay issues.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained value model reduces initialization bias

Decoupled GAE computation mitigates reward decay

VC-PPO enhances PPO for long CoT tasks

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL