Training and Evaluating with Human Label Variation: An Empirical Study

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the lack of systematic meta-evaluation for evaluation metrics in the Human Label Variation (HLV) setting. Methodologically, it introduces a novel soft evaluation metric framework integrating soft-label modeling with a discretized annotation training strategy, alongside a multi-annotator consistency measurement system. Key contributions include: (1) the first empirical demonstration that soft metrics exhibit the strongest correlation with human preferences; (2) experimental validation that disaggregated and soft-label training significantly outperforms existing approaches across most HLV metrics; and (3) evidence that the proposed soft metrics more accurately reflect human judgment than conventional hard-label metrics. The study establishes a reproducible evaluation benchmark and provides methodological guidance for HLV modeling, advancing both metric design and annotation-aware evaluation practices in machine learning.

Technology Category

Application Category

📝 Abstract

Human label variation (HLV) challenges the standard assumption that an example has a single ground truth, instead embracing the natural variation in human labelling to train and evaluate models. While various training methods and metrics for HLV have been proposed, there has been no systematic meta-evaluation of HLV evaluation metrics, contributing to the lack of clarity in the best HLV training method. We propose new evaluation metrics and training methods and empirically meta-evaluate HLV evaluation metrics. We find that training on either disaggregated annotations or soft labels often performs best across metrics, and that our proposed soft metric correlates best with human preference.

Problem

Research questions and friction points this paper is trying to address.

Human label variation challenges single ground truth assumption

Systematic meta-evaluation of HLV evaluation metrics lacking

Propose and empirically meta-evaluate new HLV metrics and training methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces new HLV evaluation metrics

Proposes disaggregated annotations training

Develops soft metric for human preference

🔎 Similar Papers

No similar papers found.

Authors to Follow