🤖 AI Summary
Existing educational question generation methods rely heavily on manually edited text, rendering them ill-suited for real-world classroom videos—comprising speech transcripts and slide keyframes—and suffer from contextual selection bias and poor alignment between generated questions, temporal anchors, and answers. This paper introduces the first multimodal question generation framework tailored for classroom videos, featuring two core innovations: dynamic context selection and answer-anchored rewriting. Our method integrates large language model–driven bimodal retrieval (speech + vision), temporal-aware context filtering, and answer-guided context reconstruction to enable cross-modal knowledge fusion and conditional question generation. Evaluated on a real-world classroom video dataset, our approach significantly improves question relevance, answer faithfulness, and temporal localization accuracy. The code and dataset are publicly released.
📝 Abstract
Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in https://github.com/mengxiayu/COSER.