Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

πŸ“… 2025-06-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the dual robustness challenges of modality missing and out-of-distribution (OOD) generalization in multimodal emotion recognition (MER), this paper proposes CIDer. First, it formalizes a novel taskβ€”Random Modality Feature Missing (RMFM)β€”to unify modeling of both issues. Second, it introduces a model-agnostic causal inference module (MACI), leveraging causal graphs and counterfactual text generation to mitigate label and linguistic biases. Third, it integrates multi-level optimization via weight-sharing self-distillation (MSSD) and word-level self-aligned attention (WSAM), jointly refining low-, mid-, and high-level features. Evaluated on a newly constructed MER OOD dataset and the RMFM benchmark, CIDer achieves significant improvements over state-of-the-art methods with fewer parameters and faster training. Moreover, MACI serves as a plug-and-play component that consistently enhances OOD generalization across diverse base models.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.
Problem

Research questions and friction points this paper is trying to address.

Addresses missing modalities and OOD data in MER
Reduces excessive parameters and computational complexity
Enhances robustness and generalization in MER
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Inference Distiller for robust MER
Self-distillation and causal inference modules
Efficient multimodal fusion with transformers
G
Guowei Zhong
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China
R
Ruohong Huan
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China
M
Mingzhen Wu
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China
Ronghua Liang
Ronghua Liang
Zhejiang University of Technology
Medical VisualizationImage ProcessingBig Data-Visualization
P
Peng Chen
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China