Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual authoring of multiple-choice questions (MCQs) for K–12 morphological vocabulary assessment is costly and suffers from low inter-rater consistency. Method: We propose a structured composite prompting strategy tailored for small- to medium-scale language models, integrating chain-of-thought reasoning and stepwise task decomposition. Using fine-tuned Gemma-2B as the baseline, we systematically compare seven prompting variants and employ GPT-4.1 to simulate expert scoring for large-scale automated evaluation. Contribution/Results: Our approach significantly enhances Gemma-2B’s generation quality under low-resource conditions, achieving superior construct alignment and pedagogical appropriateness compared to zero-shot GPT-3.5 outputs. This work constitutes the first empirical validation of lightweight models augmented with structured prompting for educational assessment item generation, establishing a reproducible, scalable, and cost-effective paradigm for automated test item authoring.

Technology Category

Application Category

📝 Abstract
This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
Problem

Research questions and friction points this paper is trying to address.

Automatically generating multiple choice questions for K-12 education
Reducing cost and inconsistency of manual test development
Bridging performance gap between small and large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured prompting strategies for model enhancement
Combining chain-of-thought with sequential design
Fine-tuning midsized models under limited data
🔎 Similar Papers
No similar papers found.
Mohammad Amini
Mohammad Amini
Mehre Alborz University
Generative AINatural Language ProcessingBusiness AnalyticsMachine Learning
B
Babak Ahmadi
Industrial and Systems Engineering Department, University of Florida, Gainesville, Florida, US
X
Xiaomeng Xiong
College of Education, University of Florida, Gainesville, Florida, US
Yilin Zhang
Yilin Zhang
Michigan State University
NanotechnologyPolymersSustainable AgricultureEnvironmental ChemistryBiopolymers
C
Christopher Qiao
Department of Computer & Information Science & Engineering, University of Florida, Gainesville, Florida, US