Outcome-Refining Process Supervision for Code Generation

📅 2024-12-19
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit insufficient reasoning capabilities for complex programming tasks: process supervision relies on costly and error-prone reward modeling, while outcome supervision struggles to coordinate multi-step reasoning. To address this, we propose a novel “outcome-refinement-as-process” supervision paradigm that eliminates explicit reward modeling and instead leverages program execution feedback—such as runtime outputs and error traces—as label-free, reliable intermediate supervision signals. Our approach integrates tree-based multi-path exploration with a lightweight model adaptation framework to enable efficient, execution-guided reasoning. Evaluated across five LLMs and three benchmark datasets, our method achieves average improvements of 26.9% in code correctness and 42.2% in execution efficiency. Notably, it significantly boosts the performance of smaller models on algorithmic competition–style tasks. This work establishes a scalable, low-overhead paradigm for complex programming reasoning, grounded in direct execution feedback rather than surrogate reward signals.

Technology Category

Application Category

📝 Abstract
Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS
Problem

Research questions and friction points this paper is trying to address.

Improving code generation for complex programming tasks
Unifying process and outcome supervision via execution
Overcoming local optima in LLM-generated code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies process and outcome supervision via executable verification
Tree-structured search with execution metrics and self-critique
Integrates runtime feedback to refine reasoning and code
🔎 Similar Papers
No similar papers found.