Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large reasoning models (LRMs) lack proactive question-asking capabilities when confronting incomplete mathematical problems—a critical gap in current evaluation frameworks. Method: We construct the first multi-scenario dataset of incomplete mathematical problems and propose a novel three-dimensional evaluation paradigm centered on “proactivity,” “over-reasoning,” and “hallucination.” Through qualitative and quantitative analyses, we systematically characterize LRM behaviors, revealing widespread avoidance of questioning, tendencies toward over-reasoning, and hallucinatory answer generation. We further investigate supervised fine-tuning (SFT) for enhancing question-asking ability, finding marginal improvements but no fundamental resolution of proactivity deficiency. Contribution/Results: This work pioneers the integration of proactive information seeking into LRM evaluation, establishing a theoretical foundation, standardized benchmark, and actionable improvement pathways toward AI agents with authentic interactive intelligence.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LRMs' ability to request missing information in problems

Assessing LRMs' limitations in proactive information-seeking behaviors

Analyzing overthinking and hallucination issues in Large Reasoning Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

New dataset for incomplete problem evaluation

Systematic evaluation of LRMs' proactive questioning

Supervised fine-tuning for genuine intelligence

🔎 Similar Papers

No similar papers found.

Authors to Follow