๐ค AI Summary
Widespread label noise (10โ25%) in NLP benchmark datasets leads to systematic underestimation of model performance, with many purported โLLM failuresโ attributable to annotation errors rather than model limitations.
Method: We propose LLM-as-a-judgeโa framework leveraging ensemble judgments from GPT-4, Claude, and Llama, combined with consistency voting and error-sensitivity analysis to automatically detect mislabeled instances; we further apply label smoothing and confident learning for robust label recalibration.
Contribution/Results: Comprehensive evaluation across the TRUE benchmark suite reveals substantial disparities in quality and efficiency among expert, crowdsourced, and LLM-generated annotations. After correction, state-of-the-art models achieve average accuracy gains of 3.2โ7.8 percentage points. This work provides the first empirical evidence of systematic label-noise interference in LLM evaluation and introduces a scalable, collaborative adjudication paradigm that reframes data correction as model performance recalibration.
๐ Abstract
NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.