🤖 AI Summary
Current large language models (LLMs) exhibit limited capability in code execution behavior reasoning—such as output prediction and statement reachability analysis—with supervised fine-tuning suffering from poor generalization. To address this, we propose a two-stage training framework: (1) a trajectory-guided data construction phase that synthesizes high-quality reasoning-chain datasets from program execution traces, augmented with instruction tuning; and (2) a GRPO-based reinforcement learning phase that jointly optimizes the policy via knowledge distillation. Our approach is the first to enable a 7B-parameter model to match GPT-4o’s performance on key code reasoning benchmarks, achieving absolute improvements of 27.1%–40.2% across three standard benchmarks; the 14B variant consistently surpasses GPT-4o. The core contributions are a program-execution-logic-aware data construction paradigm and a GRPO-driven reasoning capability alignment mechanism, which collectively enhance generalization and scalability of compact LLMs for code reasoning.
📝 Abstract
Code reasoning is a fundamental capability for large language models (LLMs) in the code domain. It involves understanding and predicting a program's execution behavior, such as determining the output for a given input or whether a specific statement will be executed. This capability is essential for downstream tasks like debugging, code generation, and program repair. Prior approaches mainly rely on supervised fine-tuning to improve performance in code reasoning tasks. However, they often show limited gains and fail to generalize across diverse scenarios. We argue this is due to two core issues: the low quality of training data and the limitations of supervised fine-tuning, which struggles to teach general reasoning skills. To address these challenges, we propose CodeReasoner, a framework that spans both dataset construction and a two-stage training process. First, we introduce a method to construct datasets that focus on the core execution logic of Python programs. Next, we apply instruction tuning to inject execution-specific knowledge distilled from a powerful teacher model. We then enhance reasoning and generalization through GRPO reinforcement learning on top of the fine-tuned model. Experiments on three widely-used code reasoning benchmarks show that CodeReasoner improves performance by 27.1% to 40.2% over prior methods using a 7B model. Notably, the 7B model matches GPT-4o on key tasks like input/output and coverage prediction. When scaled to 14B, CodeReasoner outperforms GPT-4o across all benchmarks. Ablation studies confirm the effectiveness of each training stage and highlight the importance of reasoning chains.