🤖 AI Summary
Mathematical large language models (MLLMs) lack autonomous strategy selection—specifically, the ability to dynamically choose between chain-of-thought reasoning and code execution during problem solving.
Method: We propose the first instruction-free autonomous code integration mechanism, built upon an expectation-maximization (EM) framework that unifies self-exploration and self-optimization. Our approach integrates dynamic data synthesis with off-policy reinforcement learning to enable end-to-end, iterative modeling and refinement of solution strategies.
Contribution/Results: Crucially, we internalize strategy selection as an intrinsic model capability, eliminating reliance on external human-defined scheduling heuristics. On the MATH benchmark, our method achieves 65.28% accuracy—a relative improvement of nearly 20 percentage points—while reducing code invocation frequency by up to 65%. This demonstrates substantial gains in both reasoning efficiency and robustness.
📝 Abstract
Recent research on tool integration for math Large Language Models (LLMs) aims to combine complementary strengths of chain-of-thought (CoT) reasoning and code execution. However, we discover a critical limitation: current tool-integrated math LLMs rely on externally dictated instructions to decide whether to use CoT or code, lacking the autonomy to choose the most appropriate method independently. This prompts us to study emph{Autonomous Code integration} for math LLMs, which enables models to emph{independently} develop their own methodology-selection strategy in the absence of reliable supervision. To address this challenge, we propose an innovative Expectation-Maximization (EM) formulation that refines the model's decision-making through the exploration of its capabilities. This framework alternates between (a) computing a reference strategy that improves the model's belief over its capabilities through self-exploration, and (b) updating the model based on the refined belief. We further enhance this framework with an efficient implementation, incorporating a novel data synthesis strategy and off-policy reinforcement learning. Extensive experiments demonstrate that our approach, using only a public query set, significantly boosts the performance of existing math LLMs, raising accuracy by nearly 20% to 65.28% on the challenging MATH benchmark, while reducing code executions by up to 65% .