🤖 AI Summary
To address the insufficient robustness of learner emotion recognition in online education—where existing methods rely on static multimodal fusion and assume uniformly reliable modal inputs, despite frequent modality missingness or noise in practice—this paper proposes a dynamic, adaptive emotion recognition framework. Our key contributions are: (1) a cross-modal attention alignment mechanism enabling fine-grained semantic matching across heterogeneous modalities; (2) a confidence-based modality importance estimator that dynamically assesses and weights modality reliability in real time; and (3) a temporal feedback recurrent architecture modeling emotional evolution consistency. Evaluated on re-annotated subsets IEMOCAP-EDU and MOSEI-EDU, our method achieves significant improvements over state-of-the-art approaches for recognizing four critical learning emotions—confusion, curiosity, boredom, and frustration—yielding absolute accuracy gains of 3.2–5.7%. Moreover, it demonstrates strong robustness under modality missingness and noise, enabling real-time, adaptive affective feedback.
📝 Abstract
Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems