🤖 AI Summary
To address the challenge of automatic student engagement detection in online learning, this paper proposes a dual-stream spatiotemporal network integrating facial and scene contextual cues. The spatial branch employs EfficientNetV2 to extract static features from both face and scene regions, while the temporal branch jointly leverages LSTM and Transformer architectures to model dynamic behavioral evolution over time; additionally, an education-video-specific data augmentation strategy is introduced. Our key contributions are threefold: (1) the first explicit incorporation of scene context into engagement recognition, (2) a modular dual-stream design that enhances discriminability for low-frequency and subtle engagement states, and (3) improved cross-scenario generalizability. Evaluated on the DAiSEE benchmark, our method achieves 73.43% accuracy—outperforming prior state-of-the-art approaches—and demonstrates the effectiveness and practicality of multimodal spatiotemporal co-modeling in educational affective computing.
📝 Abstract
Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .