Process Supervision-Guided Policy Optimization for Code Generation

๐Ÿ“… 2024-10-23
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing reinforcement learning (RL) approaches for code generation based on unit-test feedback rely solely on sparse terminal rewards, yielding no learning signal upon complete test failure and hindering incremental optimization for complex, long-horizon tasks. To address this, we propose a line-level Process Reward Model (PRM), the first to leverage dense, per-line correctness predictions both for reward shaping and value function initialization. PRM enables fine-grained supervision during code generation and real-time policy correction. By jointly optimizing the policy and value function under this dense reward signal, PRM achieves significant improvements over state-of-the-art methods on standard benchmarks including HumanEval and MBPP. Crucially, it maintains stable convergence even in scenarios where all unit tests failโ€”overcoming the fundamental limitation of terminal-only feedback. This marks a key advance in enabling RL-based code generation to handle realistic, challenging programming tasks requiring iterative refinement.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.
Problem

Research questions and friction points this paper is trying to address.

Improving code generation efficiency
Enhancing learning with immediate feedback
Addressing sparse rewards in RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model for feedback
Line-level code correctness guidance
RL-driven code generation enhancement
๐Ÿ”Ž Similar Papers
No similar papers found.