đ¤ AI Summary
Large language models (LLMs) exhibit limited logical reasoning and analytical capabilities in code generation tasks. Method: This paper proposes CRPE, a three-stage framework built upon Qwen2.5-Coder, integrating chain-of-thought (CoT) modeling, StepDPO reinforcement learning, and multi-stage synthetic data distillation. It introduces a novel, code-reasoningâoriented autonomous data synthesis and closed-loop enhancement mechanism, enabling an end-to-end open-source pipelineâfrom instruction parsing and expert-level reasoning data construction to iterative model capability refinement. Contribution/Results: Experiments demonstrate that COT-Coder-7B-StepDPO achieves a pass@1 score of 21.88 on LiveCodeBench, outperforming same-scale models; COT-Coder-32B-StepDPO attains 35.08, surpassing GPT-4o. Both variants significantly advance code-level logical reasoning performance.
đ Abstract
We introduce CRPE (Code Reasoning Process Enhancer), an innovative three-stage framework for data synthesis and model training that advances the development of sophisticated code reasoning capabilities in large language models (LLMs). Building upon existing system-1 models, CRPE addresses the fundamental challenge of enhancing LLMs' analytical and logical processing in code generation tasks. Our framework presents a methodologically rigorous yet implementable approach to cultivating advanced code reasoning abilities in language models. Through the implementation of CRPE, we successfully develop an enhanced COT-Coder that demonstrates marked improvements in code generation tasks. Evaluation results on LiveCodeBench (20240701-20240901) demonstrate that our COT-Coder-7B-StepDPO, derived from Qwen2.5-Coder-7B-Base, with a pass@1 accuracy of 21.88, exceeds all models with similar or even larger sizes. Furthermore, our COT-Coder-32B-StepDPO, based on Qwen2.5-Coder-32B-Base, exhibits superior performance with a pass@1 accuracy of 35.08, outperforming GPT4O on the benchmark. Overall, CRPE represents a comprehensive, open-source method that encompasses the complete pipeline from instruction data acquisition through expert code reasoning data synthesis, culminating in an autonomous reasoning enhancement mechanism.