Benchmarking Reasoning Robustness in Large Language Models

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient reasoning robustness, often relying on memorized patterns rather than systematic logical inference. This work identifies four fundamental bottlenecks: positional bias, instruction sensitivity, numerical fragility, and memory dependency. Method: We introduce Math-RoB, the first benchmark explicitly designed to evaluate reasoning robustness. It employs instruction-driven synthesis of diverse mathematical problems with intentional information omissions, augmented via multi-dimensional perturbations—including positional shuffling, instruction fine-tuning, numeric substitution, and critical information masking—to systematically induce hallucinations and expose reasoning flaws. Contribution/Results: Evaluated on state-of-the-art models including GPT-4o, Qwen2.5, and DeepSeek-V3, Math-RoB reveals up to 15 percentage points of performance degradation under perturbation, effectively distinguishing genuine reasoning capability from spurious memorization-based fitting.

Technology Category

Application Category

📝 Abstract

Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias--models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o's accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity--performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility--value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence--models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.

Problem

Research questions and friction points this paper is trying to address.

Identifies performance degradation in LLMs on novel data

Highlights four key limitations affecting reasoning robustness

Introduces Math-RoB benchmark to assess reasoning gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Math-RoB benchmark for robustness testing

Uses instruction-based approach for dataset generation

Assesses reasoning gaps via missing information hallucinations

🔎 Similar Papers

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions