Resurrecting saturated LLM benchmarks with adversarial encoding

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM benchmarks (e.g., MMLU, GPQA, WMDP-bio) suffer from performance saturation, diminishing their discriminative power for evaluating advanced models’ reasoning and recall capabilities. Method: We propose the “benchmark resurrection” paradigm—dynamically elevating performance ceilings via adversarial question perturbations (question pairing + option augmentation), treating saturation as a controllable evaluation dimension. Leveraging adversarial prompt engineering and multi-benchmark attribution analysis, we systematically characterize how model accuracy degrades predictably under subtle question perturbations. Contribution/Results: We empirically establish that this degradation is consistent across multiple SOTA models. Applied to three saturated benchmarks, our method restores significant discriminative ability without requiring re-annotation or model retraining. The approach offers a low-cost, plug-and-play solution for benchmark revitalization, enabling fine-grained assessment of model robustness and capability boundaries.

Technology Category

Application Category

📝 Abstract
Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM benchmark difficulty
Evaluating LLM reasoning and recall
Reviving outdated benchmark effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial encoding enhances benchmarks
Pairing questions increases difficulty
Adding options unsaturates benchmarks
🔎 Similar Papers
No similar papers found.