Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

πŸ“… 2025-02-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
High scores of large language models (LLMs) on public benchmarks (e.g., MMLU) may reflect overfitting to superficial dataset cues or prompt patternsβ€”not genuine linguistic understanding. Method: The authors propose C-BOD, a meta-evaluation framework that applies semantics-preserving parametric prompt rewriting, coupled with statistical significance testing and cross-model sensitivity analysis, to systematically assess LLM robustness against prompt perturbations across 26 mainstream models. Contribution/Results: Experiments reveal an average performance drop of 2.15%, with degradation statistically significant in 20 models. Larger-parameter and higher-accuracy models exhibit greater vulnerability, whereas Llama-series and low-baseline models demonstrate robustness. This work establishes the first paradigm for detecting overfitting under semantic-invariant perturbations, challenging the validity of current leaderboards and enabling data- and model-agnostic online robustness monitoring.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Detects LLMs' overfitting on benchmark datasets.
Evaluates model robustness using semantic-preserving transformations.
Promotes resilience and generalization in LLM assessments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric transformation detects overfitting
Rephrasing inputs exposes memorized patterns
Dataset-agnostic design enhances language understanding
πŸ”Ž Similar Papers
No similar papers found.