🤖 AI Summary
This work reveals that knowledge distillation can be maliciously exploited for “data laundering”—a stealthy evaluation vulnerability involving the covert injection of benchmark-specific knowledge. The proposed three-stage attack comprises: (1) knowledge distillation from a teacher model, (2) isolated training on target evaluation data, and (3) gradient-based hidden injection—elevating benchmark scores without improving genuine reasoning capability. We formally define and conceptualize “data laundering,” drawing an analogy to financial money laundering to expose a fundamental flaw in current AI evaluation: the lack of oversight over knowledge transfer pathways. Experiments using two-layer BERT student models demonstrate up to 75% spurious accuracy gains on authoritative benchmarks (e.g., GPQA), with no improvement in generalization. Our findings underscore the urgent need for robust evaluation frameworks and provide both verifiable technical warnings and methodological foundations for mitigating implicit knowledge leakage.
📝 Abstract
In this paper, we show that knowledge distillation can be subverted to manipulate language model benchmark scores, revealing a critical vulnerability in current evaluation practices. We introduce"Data Laundering,"a three-phase process analogous to financial money laundering, that enables the covert transfer of benchmark-specific knowledge through seemingly legitimate intermediate training steps. Through extensive experiments with a 2-layer BERT student model, we show how this approach can achieve substantial improvements in benchmark accuracy (up to 75% on GPQA) without developing genuine reasoning capabilities. Notably, this method can be exploited intentionally or even unintentionally, as researchers may inadvertently adopt this method that inflates scores using knowledge distillation without realizing the implications. While our findings demonstrate the effectiveness of this technique, we present them as a cautionary tale highlighting the urgent need for more robust evaluation methods in AI. This work aims to contribute to the ongoing discussion about evaluation integrity in AI development and the need for benchmarks that more accurately reflect true model capabilities. The code is available at url{https://github.com/mbzuai-nlp/data_laundering}.