Resurrecting saturated LLM benchmarks with adversarial encoding

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM benchmarks (e.g., MMLU, GPQA, WMDP-bio) suffer from performance saturation, diminishing their discriminative power for evaluating advanced models’ reasoning and recall capabilities. Method: We propose the “benchmark resurrection” paradigm—dynamically elevating performance ceilings via adversarial question perturbations (question pairing + option augmentation), treating saturation as a controllable evaluation dimension. Leveraging adversarial prompt engineering and multi-benchmark attribution analysis, we systematically characterize how model accuracy degrades predictably under subtle question perturbations. Contribution/Results: We empirically establish that this degradation is consistent across multiple SOTA models. Applied to three saturated benchmarks, our method restores significant discriminative ability without requiring re-annotation or model retraining. The approach offers a low-cost, plug-and-play solution for benchmark revitalization, enabling fine-grained assessment of model robustness and capability boundaries.

Technology Category

Application Category

📝 Abstract

Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM benchmark difficulty

Evaluating LLM reasoning and recall

Reviving outdated benchmark effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial encoding enhances benchmarks

Pairing questions increases difficulty

Adding options unsaturates benchmarks

🔎 Similar Papers

No similar papers found.

Authors to Follow