Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the instability of large language models (LLMs) when presented with inputs that are semantically equivalent but vary in lexical or syntactic form—a vulnerability that undermines their reliability in evaluation settings. For the first time, the authors integrate linguistic principles to construct perturbation datasets based on semantic equivalence through lexical substitutions (synonym replacement) and syntactic transformations (dependency structure alterations). They systematically evaluate the robustness of 23 prominent LLMs across the MMLU, SQuAD, and AMEGA benchmarks, complemented by statistical significance testing. The findings reveal that lexical perturbations consistently and significantly degrade model performance, while syntactic perturbations yield variable effects. Notably, model scale shows no consistent correlation with robustness, suggesting that current LLMs rely excessively on surface-level patterns rather than deep semantic understanding—thereby challenging the prevailing paradigm of assessing model capabilities solely through raw benchmark scores.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
evaluation robustness
lexical perturbation
syntactic perturbation
benchmark reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

lexical perturbation
syntactic transformation
truth-conditionally equivalent
LLM robustness
benchmark sensitivity
🔎 Similar Papers
No similar papers found.