Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the degraded generalization performance in zero-shot cross-lingual speech emotion recognition, which stems from linguistic distribution shifts and the absence of emotion labels in target languages. To tackle this challenge, we propose a novel approach that integrates supervised contrastive learning with speaker adversarial learning. Specifically, supervised contrastive learning aligns emotion representations across languages, while the speaker adversarial mechanism suppresses speaker-dependent cues, thereby yielding language-invariant yet emotion-discriminative features. To the best of our knowledge, this is the first study to jointly leverage these two mechanisms for zero-shot cross-lingual speech emotion recognition. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms existing approaches, effectively enhancing cross-lingual generalization capability.
📝 Abstract
Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.
Problem

Research questions and friction points this paper is trying to address.

zero-shot
cross-lingual
speech emotion recognition
distribution mismatch
emotion annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot cross-lingual
speech emotion recognition
contrastive learning
adversarial learning
emotion-discriminative representation
🔎 Similar Papers
2024-09-25IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 1