Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

📅 2024-05-30

📈 Citations: 5

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit poor robustness on math word problems (MWPs) containing irrelevant variables or distractors, often misled by numerical noise. Method: We propose a systematic numerical noise adversarial construction framework for MWPs, introducing the first benchmark datasets—PROBLEMATHIC and GSM-8K-Adv—with controlled injection of irrelevant variables. We further design a noise-robust training paradigm integrating adversarial sample generation and instruction fine-tuning (using Llama-2 and Mistral). Contribution/Results: Experiments show that base LLMs suffer an average 26% performance drop on adversarial MWPs; adversarial fine-tuning recovers ~8% accuracy, yet a residual generalization gap of up to 6% persists on GSM-8K-Adv, exposing fundamental robustness bottlenecks under realistic interference. This work establishes the first principled interference modeling paradigm for MWPs, providing both a rigorous evaluation benchmark and a methodological foundation for enhancing reasoning robustness in LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with math problems containing irrelevant information

Proposing a framework to generate adversarial math word problems

Fine-tuning LLMs improves robustness to noisy data in math problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposing prompting framework for adversarial MWPs

Fine-tuning LLMs on adversarial training samples

Introducing GSM-8K-Adv for generalizability assessment

🔎 Similar Papers

No similar papers found.

Authors to Follow