🤖 AI Summary
In speech emotion recognition, the subjectivity and inter-annotator variability inherent in multi-annotator labels are often obscured by simplistic label averaging, leading to distorted modeling of emotional dynamics. To address this, we propose an end-to-end multitask framework that, for the first time, jointly predicts individual annotator identity and continuous emotion distributions—e.g., kernel density estimates or parametric distributions—during training. This explicitly models annotator behavioral heterogeneity while preserving population-level variability. Our approach eliminates label averaging and instead integrates annotator modeling directly into the distribution learning process, enabling co-optimization of annotator-specific characteristics and emotion distributions. Evaluated under both cross-corpus and in-corpus settings, our method achieves statistically significant improvements over state-of-the-art approaches in emotion distribution prediction accuracy. It more faithfully captures emotion subjectivity and annotator disagreement, offering a principled solution to modeling annotation uncertainty in affective computing.
📝 Abstract
Emotion expression and perception are nuanced, complex, and highly subjective processes. When multiple annotators label emotional data, the resulting labels contain high variability. Most speech emotion recognition tasks address this by averaging annotator labels as ground truth. However, this process omits the nuance of emotion and inter-annotator variability, which are important signals to capture. Previous work has attempted to learn distributions to capture emotion variability, but these methods also lose information about the individual annotators. We address these limitations by learning to predict individual annotators and by introducing a novel method to create distributions from continuous model outputs that permit the learning of emotion distributions during model training. We show that this combined approach can result in emotion distributions that are more accurate than those seen in prior work, in both within- and cross-corpus settings.