🤖 AI Summary
Existing mathematical word problem (MWP) benchmarks lack realistic distractors; those few datasets incorporating distractors suffer from low difficulty, semantically implausible constructions, and easy detectability by models—undermining evaluation validity. Moreover, manually crafting distractors necessitates rewriting solution derivations, incurring high annotation costs. This paper proposes an iterative, large language model–based framework for generating semantically coherent, imperceptible distractors that preserve the original problem’s solution path and ground-truth answer. Our approach employs multi-round cognitive-guided prompting to iteratively refine distractor generation. Key contributions include: (1) eliminating the need for answer or solution rewriting, thereby drastically reducing human verification effort; and (2) enabling scalable construction of high-quality, challenging distractor-augmented MWPs. Experiments demonstrate that models fine-tuned on our augmented data exhibit significantly improved sensitivity to irrelevant information. Consequently, our method establishes a more robust and reliable benchmark for assessing mathematical reasoning capabilities.
📝 Abstract
Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions.
To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality.