Is Code Better Than Language for Algorithmic Reasoning

📅 2026-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the root causes behind the performance gap between code-based and natural language reasoning in algorithmic tasks, explicitly disentangling the contributions of intermediate representations and external execution mechanisms in tool-augmented reasoning. The authors introduce an intermediate intervention method that compels the model to express its reasoning process in executable code, which is then simulated by a language model within the context, thereby isolating the effects of representation from execution. Experiments across 40 algorithmic tasks demonstrate that deterministic code execution yields a 31.6 percentage point accuracy gain over natural language reasoning, whereas merely altering the intermediate representation results in negligible improvement (+0.15 pp), confirming that a reliable external execution mechanism—not the representational format—is the primary driver of enhanced performance.
📝 Abstract
For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at https://github.com/TerryTong-Git/ToolProj.
Problem

Research questions and friction points this paper is trying to address.

algorithmic reasoning
tool-augmented language models
intermediate representation
code execution
natural-language reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented reasoning
code execution
intermediate representation
disentangled evaluation
algorithmic reasoning