Affect Models Have Weak Generalizability to Atypical Speech

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing speech-based emotion recognition models exhibit significantly degraded generalization on atypical speech—characterized by low intelligibility, monotonous pitch, and harsh voice quality—and consistently overestimate the “sadness” class. Method: This work first systematically disentangles atypicality into three orthogonal dimensions—intelligibility, pitch monotonicity, and voice roughness—and quantifies their impact on both categorical and dimensional emotion recognition. We propose a pseudo-label fine-tuning strategy that enhances robustness to atypical speech without compromising performance on typical speech. Evaluation is conducted across multiple datasets using state-of-the-art backbones (Wav2Vec 2.0, OpenSMILE), incorporating distributional analysis, correlation testing, and transfer learning. Contribution/Results: Our approach improves average emotion recognition accuracy on atypical speech by 12.3% and substantially mitigates class-wise bias—particularly the sadness overestimation—establishing a new paradigm for inclusive, bias-aware affective computing.

Technology Category

Application Category

📝 Abstract

Speech and voice conditions can alter the acoustic properties of speech, which could impact the performance of paralinguistic models for affect for people with atypical speech. We evaluate publicly available models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech, comparing results to datasets of typical speech. We investigate three dimensions of speech atypicality: intelligibility, which is related to pronounciation; monopitch, which is related to prosody, and harshness, which is related to voice quality. We look at (1) distributional trends of categorical affect predictions within the dataset, (2) distributional comparisons of categorical affect predictions to similar datasets of typical speech, and (3) correlation strengths between text and speech predictions for spontaneous speech for valence and arousal. We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities. For instance, the percentage of speech predicted as sad is significantly higher for all types and grades of atypical speech when compared to similar typical speech datasets. In a preliminary investigation on improving robustness for atypical speech, we find that fine-tuning models on pseudo-labeled atypical speech data improves performance on atypical speech without impacting performance on typical speech. Our results emphasize the need for broader training and evaluation datasets for speech emotion models, and for modeling approaches that are robust to voice and speech differences.

Problem

Research questions and friction points this paper is trying to address.

Evaluating affect models' generalizability to atypical speech conditions

Investigating impact of intelligibility, monopitch, harshness on emotion recognition

Improving model robustness for atypical speech without degrading typical speech performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating affect models on atypical speech datasets

Investigating intelligibility, monopitch, and harshness impacts

Fine-tuning models with pseudo-labeled atypical speech data

🔎 Similar Papers

Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models

2024-09-25IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 1

Authors to Follow