🤖 AI Summary
Large language models (LLMs) suffer from unreliable text generation and lack deterministic verification mechanisms for complex mathematical reasoning. To address this, we propose SymCode, a neuro-symbolic framework that reformulates mathematical reasoning as verifiable code generation—leveraging LLMs for high-level reasoning while delegating precise computation and validation to the SymPy symbolic engine. Its core innovation lies in replacing natural-language reasoning with programmatic outputs, thereby explicitly exposing errors and enabling automated correctness checking. Evaluated on MATH-500 and OlympiadBench, SymCode achieves a 13.6-percentage-point accuracy gain over state-of-the-art prompting methods (e.g., chain-of-thought), while reducing token consumption. This approach enhances the accuracy, trustworthiness, and computational efficiency of formal mathematical reasoning.
📝 Abstract
Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.