Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers

📅 2025-08-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of precisely quantifying children’s behavioral and collaborative engagement in early childhood education, this paper proposes a fine-grained visual analysis method based on the Vision Transformer. We introduce Swin Transformer—the first application of this architecture to classroom engagement modeling—integrating multimodal visual cues including gaze direction, interactive gestures, and peer collaboration to jointly infer behavioral and collaborative participation. Compared with ViT and DeiT, our Swin-based model achieves 97.58% multi-label temporal classification accuracy on the Child-Play Gaze dataset, demonstrating superior capability in modeling local interactions and capturing global contextual dependencies. Extensive experiments confirm the model’s high robustness and scalability on real-world educational videos. The approach delivers an interpretable, deployable technical framework for intelligent educational assessment, bridging a critical gap between computer vision and pedagogical analytics.

Technology Category

Application Category

📝 Abstract
In early childhood education, accurately detecting behavioral and collaborative engagement is essential for fostering meaningful learning experiences. This paper presents an AI-driven approach that leverages Vision Transformers (ViTs) to automatically classify children's engagement using visual cues such as gaze direction, interaction, and peer collaboration. Utilizing the Child-Play gaze dataset, our method is trained on annotated video segments to classify behavioral and collaborative engagement states (e.g., engaged, not engaged, collaborative, not collaborative). We evaluated three state-of-the-art transformer models: Vision Transformer (ViT), Data-efficient Image Transformer (DeiT), and Swin Transformer. Among these, the Swin Transformer achieved the highest classification performance with an accuracy of 97.58%, demonstrating its effectiveness in modeling local and global attention. Our results highlight the potential of transformer-based architectures for scalable, automated engagement analysis in real-world educational settings.
Problem

Research questions and friction points this paper is trying to address.

Detect children's behavioral and collaborative engagement using Vision Transformers
Classify engagement states from visual cues like gaze and interaction
Evaluate transformer models for automated analysis in educational settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers classify engagement via gaze and interaction
Swin Transformer achieves 97.58% accuracy in engagement detection
Transformer models enable automated, scalable educational engagement analysis
🔎 Similar Papers
No similar papers found.
S
Sindhuja Penchala
The University of Alabama, Tuscaloosa, AL, USA
S
Saketh Reddy Kontham
The University of Alabama, Tuscaloosa, AL, USA
P
Prachi Bhattacharjee
The University of Alabama, Tuscaloosa, AL, USA
S
Sareh Karami
Mississippi State University, Mississippi State, MS, USA
M
Mehdi Ghahremani
Mississippi State University, Mississippi State, MS, USA
Noorbakhsh Amiri Golilarz
Noorbakhsh Amiri Golilarz
Assistant Professor at The University of Alabama
AI/Deep LearningCognitive NeuroscienceComputer VisionImage ProcessingHyperspectral Imaging
Shahram Rahimi
Shahram Rahimi
Department Head and Professor, The University of Alabama
Computational IntelligenceAIAgentic AIKnowledge-Based Systems