Code Execution as Grounded Supervision for LLM Reasoning

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Supervisory signals for large language model (LLM) reasoning—particularly chain-of-thought (CoT) annotations—are unreliable due to high human annotation costs and error-prone, unverifiable self-generated reasoning traces. Method: This paper proposes Code2CoT, a novel paradigm that automatically constructs verifiable, stepwise CoT supervision data from program execution traces. It leverages symbolic execution to extract deterministic execution paths, then integrates structured code-to-natural-language translation with CoT distillation—enabling fully automated, high-fidelity, and scalable supervision generation without human annotation. Contribution/Results: Code2CoT significantly improves LLM generalization across mathematical reasoning, symbolic deduction, and multi-hop question answering benchmarks, while reducing inference token consumption by mitigating redundant reasoning and repetitive generation. Its core innovation lies in grounding CoT supervision construction in the intrinsic determinism and verifiability of program execution—a first in the field.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

Problem

Research questions and friction points this paper is trying to address.

Generating reliable chain-of-thought supervision for LLMs

Leveraging code execution for verifiable reasoning traces

Reducing meaningless repetition in LLM reasoning outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages code execution for reasoning supervision

Extracts verifiable reasoning traces from programs

Transforms code traces into natural language CoT

🔎 Similar Papers

CodeMind: Evaluating Large Language Models for Code Reasoning