π€ AI Summary
Real-world speech emotion recognition (SER) suffers from performance degradation due to acoustic noise interference and cross-dataset distribution shifts. To address this, we propose a two-stage robust representation learning framework. In the first stage, Emotion-Decoupled Representation Learning (EDRL) disentangles emotion-specific features from non-emotional shared features. In the second stage, Multi-Block Embedding Alignment (MEA) maps representations from heterogeneous domains into a unified discriminative latent subspace, explicitly enhancing noise robustness and cross-domain generalization. Our method maximizes covariance between input speech and decoupled representations, thereby reducing reliance on clean, labeled data. Extensive experiments on multiple noisy and cross-corpus benchmarks demonstrate significant improvements over state-of-the-art baselines, validating the frameworkβs effectiveness and robustness under complex real-world conditions.
π Abstract
Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.