Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multilingual speech emotion recognition (SER) faces a generalization bottleneck in zero-shot cross-lingual settings due to acoustic variability and linguistic diversity. To address this, we propose a two-stage contrastive alignment framework that disentangles emotion semantics from language identity in speech representations. Our method jointly optimizes emotion discriminability and language invariance by integrating semantic priors from large language models (LLMs) with contrastive learning. To support this work, we introduce M5SER—the first large-scale synthetic multilingual SER dataset—comprising over 100,000 utterances across five typologically diverse languages. Extensive experiments demonstrate substantial improvements in zero-shot transfer performance across multiple unseen languages and benchmark datasets, achieving state-of-the-art accuracy. The framework’s effectiveness underscores the value of explicit disentanglement and LLM-informed semantic guidance for cross-lingual emotion representation learning.

Technology Category

Application Category

📝 Abstract
Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.
Problem

Research questions and friction points this paper is trying to address.

Address zero-shot emotion recognition across diverse languages
Overcome variability in voice characteristics and linguistic diversity
Align speech signals with linguistic features using contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging contrastive learning for multilingual features
Two-stage training aligns speech with linguistic features
Introducing large-scale synthetic dataset M5SER
🔎 Similar Papers
No similar papers found.