🤖 AI Summary
To address the challenge of real-time student engagement monitoring in online learning, this paper proposes EngageFormer—a novel dual-layer Transformer architecture featuring tri-view sequential pooling and global representation fusion. The model enables multi-source frequency-domain feature modeling and cross-dataset generalization. It integrates video modality analysis, multi-view temporal modeling, and a lightweight MLP classifier for end-to-end attention-level engagement recognition. Evaluated on five benchmark datasets—DAiSEE, BAUM-1, YawDD, MUG, and DISFA—EngageFormer achieves a peak accuracy of 99.16%, outperforming state-of-the-art methods on three datasets. This work establishes a deployable, fine-grained engagement perception baseline for online education, balancing high performance with strong generalization capability across diverse domains and data distributions.
📝 Abstract
The COVID-19 pandemic and the internet's availability have recently boosted online learning. However, monitoring engagement in online learning is a difficult task for teachers. In this context, timely automatic student engagement classification can help teachers in making adaptive adjustments to meet students' needs. This paper proposes EngageFormer, a transformer based architecture with sequence pooling using video modality for engagement classification. The proposed architecture computes three views from the input video and processes them in parallel using transformer encoders; the global encoder then processes the representation from each encoder, and finally, multi layer perceptron (MLP) predicts the engagement level. A learning centered affective state dataset is curated from existing open source databases. The proposed method achieved an accuracy of 63.9%, 56.73%, 99.16%, 65.67%, and 74.89% on Dataset for Affective States in E-Environments (DAiSEE), Bahcesehir University Multimodal Affective Database-1 (BAUM-1), Yawning Detection Dataset (YawDD), University of Texas at Arlington Real-Life Drowsiness Dataset (UTA-RLDD), and curated learning-centered affective state dataset respectively. The achieved results on the BAUM-1, DAiSEE, and YawDD datasets demonstrate state-of-the-art performance, indicating the superiority of the proposed model in accurately classifying affective states on these datasets. Additionally, the results obtained on the UTA-RLDD dataset, which involves two-class classification, serve as a baseline for future research. These results provide a foundation for further investigations and serve as a point of reference for future works to compare and improve upon.