Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Widespread label noise (10–25%) in NLP benchmark datasets leads to systematic underestimation of model performance, with many purported “LLM failures” attributable to annotation errors rather than model limitations. Method: We propose LLM-as-a-judge—a framework leveraging ensemble judgments from GPT-4, Claude, and Llama, combined with consistency voting and error-sensitivity analysis to automatically detect mislabeled instances; we further apply label smoothing and confident learning for robust label recalibration. Contribution/Results: Comprehensive evaluation across the TRUE benchmark suite reveals substantial disparities in quality and efficiency among expert, crowdsourced, and LLM-generated annotations. After correction, state-of-the-art models achieve average accuracy gains of 3.2–7.8 percentage points. This work provides the first empirical evidence of systematic label-noise interference in LLM evaluation and introduces a scalable, collaborative adjudication paradigm that reframes data correction as model performance recalibration.

Technology Category

Application Category

📝 Abstract

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.

Problem

Research questions and friction points this paper is trying to address.

Detecting label errors in NLP benchmark datasets

Comparing annotation quality from experts, crowdsourcing, and LLMs

Mitigating mislabeled data effects on model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLM ensemble to detect label errors

Comparing expert crowd-sourced and LLM annotation quality

Correcting label errors to improve model performance

🔎 Similar Papers

Augmenting NER Datasets with LLMs: Towards Automated and Refined Annotation

2024-03-30Citations: 0

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

2024-07-15arXiv.orgCitations: 0