🤖 AI Summary
Manual authoring of multiple-choice questions (MCQs) for K–12 morphological vocabulary assessment is costly and suffers from low inter-rater consistency. Method: We propose a structured composite prompting strategy tailored for small- to medium-scale language models, integrating chain-of-thought reasoning and stepwise task decomposition. Using fine-tuned Gemma-2B as the baseline, we systematically compare seven prompting variants and employ GPT-4.1 to simulate expert scoring for large-scale automated evaluation. Contribution/Results: Our approach significantly enhances Gemma-2B’s generation quality under low-resource conditions, achieving superior construct alignment and pedagogical appropriateness compared to zero-shot GPT-3.5 outputs. This work constitutes the first empirical validation of lightweight models augmented with structured prompting for educational assessment item generation, establishing a reproducible, scalable, and cost-effective paradigm for automated test item authoring.
📝 Abstract
This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.