🤖 AI Summary
This study addresses the limitations of current educational question generation methods, which often rely on subjective evaluation or closed-source large language models (LLMs) and lack systematic investigation into the constraints of real-world teaching scenarios regarding generation, evaluation, and deployment. It presents the first systematic comparison between small language models (SLMs) and LLMs for this task, introducing a reproducible, education-oriented evaluation framework grounded in Bloom’s taxonomy. Using expert-annotated data, the work analyzes the alignment between model-based automated scoring and human judgments. Findings indicate that SLMs are competitive with LLMs across key educational quality dimensions and offer advantages in privacy preservation and local deployment. However, automated evaluation exhibits systematic biases, underscoring the necessity of human-in-the-loop mechanisms to ensure reliability.
📝 Abstract
Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.