🤖 AI Summary
This study investigates whether AI-generated educational assessments can match human-authored items in psychometric quality and user satisfaction.
Method: We developed an automated NLP course quiz generation pipeline using GPT-4o-mini and conducted the first integrated psychometric evaluation combining unidimensional and multidimensional Item Response Theory (IRT) with Differential Item Functioning (DIF) analysis. A mixed-methods assessment was performed from both student and domain-expert perspectives.
Contribution/Results: LLM-generated items demonstrated strong discrimination and appropriate difficulty levels, with stable IRT parameter estimates. DIF analysis identified only two potentially biased items requiring review. Students and experts rated item quality, clarity, and pedagogical alignment highly (mean ≥4.3/5). This work establishes a methodological framework and empirical foundation for AI-powered, scalable, interpretable, and psychometrically sound educational assessment.
📝 Abstract
This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.