🤖 AI Summary
In single-agent mathematical reasoning systems, tight coupling between reasoning and code generation imposes excessive cognitive load, hindering robust inference. Method: This paper proposes the first decoupled two-agent collaborative framework, wherein distinct agents specialize in reasoning and code generation respectively—enabling problem decomposition, isolated code execution, and reward shaping tailored to each agent’s role. The framework jointly optimizes both agents via imitation learning and advantage-based reinforcement learning. Results: Compared to single-agent baselines, our approach significantly increases the proportion of correct reasoning trajectories, improves accuracy on multi-step mathematical problems, and enhances training stability. Its core contribution is the first instantiation of a decoupled, collaborative, and credit-assignable two-agent architecture—establishing a more robust paradigm for complex reasoning tasks.
📝 Abstract
Current tool-integrated mathematical reasoning systems often adopt a single-agent paradigm, where one large language model handles problem reasoning, code generation, and code execution in an integrated workflow. While this design eases coordination, we hypothesize that it imposes cognitive load interference, as the agent must interleave long-horizon reasoning with precise program synthesis. We validate this hypothesis through a controlled comparison between a reasoning-only agent and a reasoning-plus-code agent, finding that the latter produces significantly fewer correct reasoning paths despite having tool-calling capabilities. To address this, we propose a dual-agent hybrid framework: a Reasoning Agent performs stepwise problem decomposition, and a Code Agent handles code generation and execution. Training combines imitation learning and reinforcement learning: the Code Agent receives strong rewards for matching intermediate ground-truth programs and weaker rewards for valid execution, while the Reasoning Agent is optimized chiefly via final-answer accuracy using advantage estimation to credit intermediate steps. This decoupled role design reduces cognitive interference and promotes stable reasoning-coding coordination.