PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the limitations of current large language models in mathematical reasoning, which stem from weak numerical computation capabilities and insufficient abstract numeric reasoning, as well as the absence of fine-grained benchmarks that integrate numerical processing with mathematical reasoning. To bridge this gap, the authors introduce PyraMathBench, a hierarchical evaluation benchmark comprising 32,505 problems across four cognitive dimensions, 14 subcategories, and two modalities. They further propose the SOLVE module—incorporating fuzzy matching and filtering of low-quality tool calls—and an Interactive Relative Policy Optimization (IRPO) algorithm to enhance the synergy between numerical and mathematical reasoning. Experimental results demonstrate that this approach significantly improves Qwen-2.5’s performance on PyraMathBench by 5.0 points, effectively mitigating its numerical deficiencies.

📝 Abstract

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

Problem

Research questions and friction points this paper is trying to address.

numerical reasoning

mathematical capability

large language models

benchmark

math word problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

PyraMathBench

SOLVE

IRPO