Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

📅 2024-05-30
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit poor robustness on math word problems (MWPs) containing irrelevant variables or distractors, often misled by numerical noise. Method: We propose a systematic numerical noise adversarial construction framework for MWPs, introducing the first benchmark datasets—PROBLEMATHIC and GSM-8K-Adv—with controlled injection of irrelevant variables. We further design a noise-robust training paradigm integrating adversarial sample generation and instruction fine-tuning (using Llama-2 and Mistral). Contribution/Results: Experiments show that base LLMs suffer an average 26% performance drop on adversarial MWPs; adversarial fine-tuning recovers ~8% accuracy, yet a residual generalization gap of up to 6% persists on GSM-8K-Adv, exposing fundamental robustness bottlenecks under realistic interference. This work establishes the first principled interference modeling paradigm for MWPs, providing both a rigorous evaluation benchmark and a methodological foundation for enhancing reasoning robustness in LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with math problems containing irrelevant information
Proposing a framework to generate adversarial math word problems
Fine-tuning LLMs improves robustness to noisy data in math problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposing prompting framework for adversarial MWPs
Fine-tuning LLMs on adversarial training samples
Introducing GSM-8K-Adv for generalizability assessment
🔎 Similar Papers
No similar papers found.
U
Ujjwala Anantheswaran
Arizona State University
H
Himanshu Gupta
Arizona State University
K
Kevin Scaria
Arizona State University
S
Shreyas Verma
Georgia Institute of Technology
Chitta Baral
Chitta Baral
Professor of Computer Science, Arizona State University
Knowledge RepresentationNLPVisionRoboticsIntegrated Systems
Swaroop Mishra
Swaroop Mishra
Research Scientist, Google DeepMind
Large Language ModelsNatural Language Processing