Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Enhancing large language models’ capability in multi-step complex mathematical reasoning remains challenging, particularly regarding data selection and construction for continual pretraining (CPT). Method: We systematically evaluate CPT data sources, demonstrating that problem-solving–oriented mathematical data significantly outperforms generic mathematical corpora. We propose a “mentor-augmented” data synthesis framework integrating expert-guided prompting, back-translation, and self-distillation. Furthermore, we empirically establish CPT’s superiority over supervised fine-tuning (SFT) in modeling multi-step reasoning structures. Contribution/Results: Applied to the JiuZhang-8B mathematical base model, our approach yields substantial performance gains across multiple mathematical reasoning benchmarks. The resulting model ranks among the top-performing open-source mathematical base models to date. This work provides a reproducible methodology and empirical benchmark for optimizing mathematical reasoning in large language models, advancing both practical development and systematic evaluation.

Technology Category

Application Category

📝 Abstract
Advancements in LLMs have significantly expanded their capabilities across various domains. However, mathematical reasoning remains a challenging area, prompting the development of math-specific LLMs. These models typically follow a two-stage training paradigm: pre-training with math-related corpora and post-training with problem datasets for SFT. Despite these efforts, the improvements in mathematical reasoning achieved through continued pre-training (CPT) are often less significant compared to those obtained via SFT. This study addresses this discrepancy by exploring alternative strategies during the pre-training phase, focusing on the use of problem-solving data over general mathematical corpora. We investigate three primary research questions: (1) Can problem-solving data enhance the model's mathematical reasoning capabilities more effectively than general mathematical corpora during CPT? (2) Are synthetic data from the same source equally effective, and which synthesis methods are most efficient? (3) How do the capabilities developed from the same problem-solving data differ between the CPT and SFT stages, and what factors contribute to these differences? Our findings indicate that problem-solving data significantly enhances the model's mathematical capabilities compared to general mathematical corpora. We also identify effective data synthesis methods, demonstrating that the tutorship amplification synthesis method achieves the best performance. Furthermore, while SFT facilitates instruction-following abilities, it underperforms compared to CPT with the same data, which can be partially attributed to its poor learning capacity for hard multi-step problem-solving data. These insights provide valuable guidance for optimizing the mathematical reasoning capabilities of LLMs, culminating in our development of a powerful mathematical base model called JiuZhang-8B.
Problem

Research questions and friction points this paper is trying to address.

Mathematical Reasoning
Language Models
Complex Problem Solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mathematical Problem Solving
Data Training
Model Enhancement
🔎 Similar Papers
No similar papers found.