Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

πŸ“… 2025-10-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Real-world speech emotion recognition (SER) suffers from performance degradation due to acoustic noise interference and cross-dataset distribution shifts. To address this, we propose a two-stage robust representation learning framework. In the first stage, Emotion-Decoupled Representation Learning (EDRL) disentangles emotion-specific features from non-emotional shared features. In the second stage, Multi-Block Embedding Alignment (MEA) maps representations from heterogeneous domains into a unified discriminative latent subspace, explicitly enhancing noise robustness and cross-domain generalization. Our method maximizes covariance between input speech and decoupled representations, thereby reducing reliance on clean, labeled data. Extensive experiments on multiple noisy and cross-corpus benchmarks demonstrate significant improvements over state-of-the-art baselines, validating the framework’s effectiveness and robustness under complex real-world conditions.

Technology Category

Application Category

πŸ“ Abstract
Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech emotion recognition robustness in noisy environments
Improving cross-corpus generalization through disentangled representation learning
Aligning embeddings to maximize covariance with original speech input
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion-Disentangled Representation Learning extracts discriminative features
Multiblock Embedding Alignment projects features into latent subspace
Joint discriminative subspace maximizes covariance with speech input
πŸ”Ž Similar Papers
No similar papers found.