🤖 AI Summary
This paper addresses the lack of systematic meta-evaluation for evaluation metrics in the Human Label Variation (HLV) setting. Methodologically, it introduces a novel soft evaluation metric framework integrating soft-label modeling with a discretized annotation training strategy, alongside a multi-annotator consistency measurement system. Key contributions include: (1) the first empirical demonstration that soft metrics exhibit the strongest correlation with human preferences; (2) experimental validation that disaggregated and soft-label training significantly outperforms existing approaches across most HLV metrics; and (3) evidence that the proposed soft metrics more accurately reflect human judgment than conventional hard-label metrics. The study establishes a reproducible evaluation benchmark and provides methodological guidance for HLV modeling, advancing both metric design and annotation-aware evaluation practices in machine learning.
📝 Abstract
Human label variation (HLV) challenges the standard assumption that an example has a single ground truth, instead embracing the natural variation in human labelling to train and evaluate models. While various training methods and metrics for HLV have been proposed, there has been no systematic meta-evaluation of HLV evaluation metrics, contributing to the lack of clarity in the best HLV training method. We propose new evaluation metrics and training methods and empirically meta-evaluate HLV evaluation metrics. We find that training on either disaggregated annotations or soft labels often performs best across metrics, and that our proposed soft metric correlates best with human preference.