Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the degraded generalization performance in zero-shot cross-lingual speech emotion recognition, which stems from linguistic distribution shifts and the absence of emotion labels in target languages. To tackle this challenge, we propose a novel approach that integrates supervised contrastive learning with speaker adversarial learning. Specifically, supervised contrastive learning aligns emotion representations across languages, while the speaker adversarial mechanism suppresses speaker-dependent cues, thereby yielding language-invariant yet emotion-discriminative features. To the best of our knowledge, this is the first study to jointly leverage these two mechanisms for zero-shot cross-lingual speech emotion recognition. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms existing approaches, effectively enhancing cross-lingual generalization capability.

📝 Abstract

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

Problem

Research questions and friction points this paper is trying to address.

zero-shot

cross-lingual

speech emotion recognition

distribution mismatch

emotion annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot cross-lingual

speech emotion recognition

contrastive learning