🤖 AI Summary
Existing LLM benchmarks (e.g., MMLU, GPQA, WMDP-bio) suffer from performance saturation, diminishing their discriminative power for evaluating advanced models’ reasoning and recall capabilities.
Method: We propose the “benchmark resurrection” paradigm—dynamically elevating performance ceilings via adversarial question perturbations (question pairing + option augmentation), treating saturation as a controllable evaluation dimension. Leveraging adversarial prompt engineering and multi-benchmark attribution analysis, we systematically characterize how model accuracy degrades predictably under subtle question perturbations.
Contribution/Results: We empirically establish that this degradation is consistent across multiple SOTA models. Applied to three saturated benchmarks, our method restores significant discriminative ability without requiring re-annotation or model retraining. The approach offers a low-cost, plug-and-play solution for benchmark revitalization, enabling fine-grained assessment of model robustness and capability boundaries.
📝 Abstract
Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.