Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) lack proactive question-asking capabilities when confronting incomplete mathematical problems—a critical gap in current evaluation frameworks. Method: We construct the first multi-scenario dataset of incomplete mathematical problems and propose a novel three-dimensional evaluation paradigm centered on “proactivity,” “over-reasoning,” and “hallucination.” Through qualitative and quantitative analyses, we systematically characterize LRM behaviors, revealing widespread avoidance of questioning, tendencies toward over-reasoning, and hallucinatory answer generation. We further investigate supervised fine-tuning (SFT) for enhancing question-asking ability, finding marginal improvements but no fundamental resolution of proactivity deficiency. Contribution/Results: This work pioneers the integration of proactive information seeking into LRM evaluation, establishing a theoretical foundation, standardized benchmark, and actionable improvement pathways toward AI agents with authentic interactive intelligence.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LRMs' ability to request missing information in problems
Assessing LRMs' limitations in proactive information-seeking behaviors
Analyzing overthinking and hallucination issues in Large Reasoning Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

New dataset for incomplete problem evaluation
Systematic evaluation of LRMs' proactive questioning
Supervised fine-tuning for genuine intelligence
🔎 Similar Papers
No similar papers found.
Y
Youcheng Huang
Sichuan University, China; Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China
Bowen Qin
Bowen Qin
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
C
Chen Huang
Sichuan University, China; Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China; Institute of Data Science, National University of Singapore, Singapore
Duanyu Feng
Duanyu Feng
Sichuan University
Machine learningNumerical optimizationNature language processing
X
Xi Yang
Beijing Academy of Artificial Intelligence
W
Wenqiang Lei
Sichuan University, China; Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China