More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a “directional framing bias” in large language models (LLMs) on numerical comparison tasks: logically equivalent prompts—e.g., phrased with “more,” “less,” or “equal”—induce systematic prediction shifts. We introduce MathComp, a benchmark of 300 controlled comparison scenarios, to formally define and quantify this bias for the first time. Experiments across three model families, 14 prompt variants, and controlled prompt engineering reveal that incorporating demographic identity terms (e.g., “female,” “Black”) amplifies the bias by 37% on average. Chain-of-thought (CoT) prompting mitigates the effect, with free-form reasoning proving more robust than structured formats. Our core contributions are threefold: (1) formal definition and quantification of semantic framing as a directional perturbation to numerical reasoning; (2) empirical demonstration that linguistic structure interacts with social referents to exacerbate bias; and (3) establishment of a novel analytical dimension for assessing LLM reasoning robustness and fairness.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.
Problem

Research questions and friction points this paper is trying to address.

LLMs exhibit directional bias in comparative reasoning tasks
Prompt phrasing systematically steers model predictions inaccurately
Demographic terms amplify bias despite identical quantitative inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MathComp benchmark for comparative reasoning
Uses chain-of-thought prompting to reduce biases
Examines demographic impact on directional drift
🔎 Similar Papers
No similar papers found.