🤖 AI Summary
This work addresses the degraded generalization performance in zero-shot cross-lingual speech emotion recognition, which stems from linguistic distribution shifts and the absence of emotion labels in target languages. To tackle this challenge, we propose a novel approach that integrates supervised contrastive learning with speaker adversarial learning. Specifically, supervised contrastive learning aligns emotion representations across languages, while the speaker adversarial mechanism suppresses speaker-dependent cues, thereby yielding language-invariant yet emotion-discriminative features. To the best of our knowledge, this is the first study to jointly leverage these two mechanisms for zero-shot cross-lingual speech emotion recognition. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms existing approaches, effectively enhancing cross-lingual generalization capability.
📝 Abstract
Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.