TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the lack of structured annotation benchmarks in existing classroom videos for evaluating multimodal models. The authors construct a multimodal teaching observation benchmark comprising 30 international lecture videos segmented into 5,158 fifteen-second clips, annotated with 39 binary-coded visual and non-visual dimensions. For the first time, this benchmark integrates fine-grained scene-level labels with whole-lesson expert ratings and qualitative assessments, establishing a two-tier human reference framework. They propose a Krippendorff’s alpha–based approach to construct reliability- and prevalence-aware labels and evaluate five state-of-the-art vision-language foundation models across tasks involving text-only, text-plus-image-frame, and full-lesson comprehension, using human annotations, expert scoring, and an LLM-as-judge protocol. Results reveal no single model dominates across all tasks; incorporating intermediate frames improves both true and false attribution accuracy, yet models tend to overrate instruction that is procedurally clear but lacks depth—highlighting the irreplaceable role of expert judgment in complex pedagogical assessment.

📝 Abstract

Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes. Seven researchers annotated each scene with 39 binary observation codes, covering 20 visual codes, such as gesture, board work, pointing, and visual materials, and 19 nonvisual codes, such as instruction, monitoring, questioning, feedback, and reflection. Gold segment labels are constructed using reliability- and prevalence-aware rules based on Krippendorff's alpha. In addition to segment-level labels, three expert raters produced lesson-level ratings and qualitative evaluations of instructional design, instructional delivery, learner response, learning materials, and lesson closure across the 30 lessons, with rater coverage detailed in the body. Using these two human reference layers, we evaluate five vision-capable frontier LLMs across three tracks - text-only segment coding, text + frame segment coding, and lesson-level coverage scored under an LLM-as-judge protocol - and find that no single model consistently outperforms others across all three tracks, that adding a mid-frame inflates both true and false attributions per scene, and that model evaluations over-rate procedurally clear lessons relative to expert raters. \textit{TeachObs} therefore supports both fine-grained annotation benchmarking and whole-lesson evaluation, showing where AI systems can assist classroom video analysis and where expert judgment remains necessary across varied subjects, classroom formats, and annotation difficulty levels.

Problem

Research questions and friction points this paper is trying to address.

classroom videos

multimodal teaching observation

model evaluation

pedagogical signals

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal teaching observation

human-validated benchmark

classroom video analysis