🤖 AI Summary
This study addresses the underexamined dual biases of age (adolescents vs. adults aged 55+) and language (Danish vs. English) in speech emotion recognition (SER). We propose a goal-directed behavioral framework that employs a visual emotion-guidance interface to elicit controlled emotional speech, integrates valence-arousal space mapping, and logs real-time human–machine intent discrepancies to quantify emotional misalignment risk. A custom system enables cross-lingual, cross-age real-time prediction and logging. Experiments show model robustness across age and language dimensions—no statistically significant performance differences—yet reveal systematic limitations in recognizing high-arousal emotions. Our key contribution is the formal incorporation of intent alignment into the SER evaluation paradigm, shifting focus from accuracy-centric metrics toward inclusive modeling grounded in user experience and affective semantic alignment.
📝 Abstract
This study explores how age and language shape the deliberate vocal expression of emotion, addressing underexplored user groups, Teenagers (N = 12) and Adults 55+ (N = 12), within speech emotion recognition (SER). While most SER systems are trained on spontaneous, monolingual English data, our research evaluates how such models interpret intentionally performed emotional speech across age groups and languages (Danish and English). To support this, we developed a novel experimental paradigm combining a custom user interface with a backend for real-time SER prediction and data logging. Participants were prompted to hit visual targets in valence-arousal space by deliberately expressing four emotion targets. While limitations include some reliance on self-managed voice recordings and inconsistent task execution, the results suggest contrary to expectations, no significant differences between language or age groups, and a degree of cross-linguistic and age robustness in model interpretation. Though some limitations in high-arousal emotion recognition were evident. Our qualitative findings highlight the need to move beyond system-centered accuracy metrics and embrace more inclusive, human-centered SER models. By framing emotional expression as a goal-directed act and logging the real-time gap between human intent and machine interpretation, we expose the risks of affective misalignment.