ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of automatic student engagement detection in online learning, this paper proposes a dual-stream spatiotemporal network integrating facial and scene contextual cues. The spatial branch employs EfficientNetV2 to extract static features from both face and scene regions, while the temporal branch jointly leverages LSTM and Transformer architectures to model dynamic behavioral evolution over time; additionally, an education-video-specific data augmentation strategy is introduced. Our key contributions are threefold: (1) the first explicit incorporation of scene context into engagement recognition, (2) a modular dual-stream design that enhances discriminability for low-frequency and subtle engagement states, and (3) improved cross-scenario generalizability. Evaluated on the DAiSEE benchmark, our method achieves 73.43% accuracy—outperforming prior state-of-the-art approaches—and demonstrates the effectiveness and practicality of multimodal spatiotemporal co-modeling in educational affective computing.

Technology Category

Application Category

📝 Abstract
Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .
Problem

Research questions and friction points this paper is trying to address.

Detecting student engagement in online learning videos
Combining facial expressions with scene context analysis
Improving engagement recognition accuracy using spatiotemporal cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream architecture using face and scene cues
EfficientNetV2 extracts spatial features from video frames
LSTM and Transformer model temporal engagement patterns
🔎 Similar Papers
No similar papers found.
P
Prateek Gothwal
Computer Science & Engineering, University of Colorado Denver, Colorado, CO 80234
D
Deeptimaan Banerjee
Computer Science & Engineering, University of Colorado Denver, Colorado, CO 80234
Ashis Kumer Biswas
Ashis Kumer Biswas
University of Colorado Denver
Machine LearningDeep LearningBioinformatics