🤖 AI Summary
Existing evaluations of low-bit quantization (e.g., INT4/INT2) on large language models (LLMs) lack fine-grained analysis of its impact on mathematical reasoning—particularly the distinction between numerical computation and reasoning planning capabilities.
Method: We propose the first multi-dimensional evaluation framework specifically for quantization effects on mathematical reasoning, decoupling these two core capabilities. Our approach integrates layer-wise sensitivity analysis, step-level reasoning trajectory comparison, and quantitative tracking across capability dimensions.
Contribution/Results: Experiments on benchmarks such as MATH reveal that reasoning planning degrades significantly (up to −38%), whereas numerical computation remains comparatively robust. Critical vulnerability points are identified in attention intermediate layers and MLP output representations. The framework provides an interpretable, capability-aware diagnostic tool for quantization-robustness optimization, enabling targeted mitigation strategies for mathematically demanding tasks.
📝 Abstract
Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. We introduce a multidimensional evaluation framework that qualitatively assesses specific capability dimensions and conduct quantitative analyses on the step-by-step outputs of various quantization methods. Our results demonstrate that quantization differentially affects numerical computation and reasoning planning abilities, identifying key areas where quantized models experience performance degradation.