Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies and systematically validates “self-bias” in large language models (LLMs) acting as benchmark generators—i.e., inflated performance estimates on self-generated evaluation sets. This bias stems from three interrelated sub-biases: domain mismatch, stylistic inconsistency, and erroneous labeling. To address this, we formally define self-bias for the first time and propose Silencer, a general-purpose debiasing framework. Silencer employs multi-LLM collaborative generation, sample-level and benchmark-level heterogeneity modeling, bias溯源 (origin tracing), and adaptive filtering—requiring neither human annotation nor model fine-tuning. Experiments demonstrate near-zero residual self-bias, with evaluation validity—measured by Pearson correlation against human-constructed benchmarks—improving from 0.655 to 0.833. Moreover, Silencer exhibits strong cross-model and cross-task generalization.

Technology Category

Application Category

📝 Abstract
LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
Problem

Research questions and friction points this paper is trying to address.

Identifying self-bias in LLM-generated benchmarks
Analyzing sub-biases from domain, style, and labels
Mitigating bias via multi-generator heterogeneity framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multiple generators' heterogeneity
Neutralizes bias at sample and benchmark levels
Suppresses self-bias to near zero
🔎 Similar Papers
No similar papers found.