Investigating the Impact of Word Informativeness on Speech Emotion Recognition

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Speech emotion recognition (SER) suffers from the masking of fine-grained emotional cues by utterance-level statistical features (e.g., energy, mean F0). To address this, we propose a semantics-driven dynamic feature extraction paradigm: leveraging pre-trained BERT-style language models to quantify word-level informativeness, thereby identifying semantically salient speech segments; prosodic features (energy, F0, and their statistics) and self-supervised speech representations are then extracted exclusively from these high-informativeness regions. This work is the first to explicitly incorporate word-level semantic informativeness into SER, overcoming the limitations of conventional coarse-grained acoustic modeling. Extensive experiments on benchmark datasets—including RAVDESS and IEMOCAP—demonstrate significant improvements in classification accuracy, validating that semantics-guided local acoustic modeling yields more discriminative representations for emotion recognition.

Technology Category

Application Category

📝 Abstract

In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.

Problem

Research questions and friction points this paper is trying to address.

Identifying speech segments with relevant acoustic variations for emotion recognition

Overcoming limitations of traditional sentence-level feature computation methods

Using word informativeness to improve speech emotion recognition accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses word informativeness for segment selection

Computes acoustic features on selected segments

Combines prosodic and self-supervised features

🔎 Similar Papers

No similar papers found.