When Benchmarks Leak: Inference-Time Decontamination for LLMs

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the pervasive issue of test set contamination in large language model (LLM) evaluation, which often inflates performance metrics and undermines assessment reliability. The authors propose a novel inference-time debiasing method that applies bounded perturbations in the input embedding space to suppress models’ reliance on memorization-based shortcuts. Leveraging a reference model, the approach adaptively generates instance-specific perturbation directions without altering the original evaluation dataset. Extensive experiments across multiple open-source LLMs and standard benchmarks demonstrate that the method effectively mitigates performance inflation caused by contamination while preserving near-perfect accuracy on clean, uncontaminated samples. This strategy thus achieves a strong balance between effectiveness in reducing contamination artifacts and practicality for real-world evaluation scenarios.

Technology Category

Application Category

📝 Abstract

Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.

Problem

Research questions and friction points this paper is trying to address.

test set contamination

benchmark leakage

large language models

evaluation reliability

memorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

decontamination

inference-time perturbation

test set contamination