Evaluating LLM-Generated Q&A Test: a Student-Centered Study

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether AI-generated educational assessments can match human-authored items in psychometric quality and user satisfaction. Method: We developed an automated NLP course quiz generation pipeline using GPT-4o-mini and conducted the first integrated psychometric evaluation combining unidimensional and multidimensional Item Response Theory (IRT) with Differential Item Functioning (DIF) analysis. A mixed-methods assessment was performed from both student and domain-expert perspectives. Contribution/Results: LLM-generated items demonstrated strong discrimination and appropriate difficulty levels, with stable IRT parameter estimates. DIF analysis identified only two potentially biased items requiring review. Students and experts rated item quality, clarity, and pedagogical alignment highly (mean ≥4.3/5). This work establishes a methodological framework and empirical foundation for AI-powered, scalable, interpretable, and psychometrically sound educational assessment.

Technology Category

Application Category

📝 Abstract
This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.
Problem

Research questions and friction points this paper is trying to address.

Develops AI pipeline for reliable Q&A test generation
Evaluates GPT-4o-mini test quality with students and experts
Compares LLM-generated and human-authored test performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated GPT-4o-mini Q&A test generation
Mixed-format IRT analysis for quality metrics
Uniform DIF check for item review
🔎 Similar Papers
No similar papers found.
Anna Wróblewska
Anna Wróblewska
Warsaw University of Technology
machine learningnatural language processingimage processingmultimodal learningrecommendation
B
Bartosz Grabek
Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland
J
Jakub ´Swistak
Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland
Daniel Dan
Daniel Dan
Assistant Professor, Modul University, Vienna
Artificial IntelligenceApplied Data ScienceMarketingTourismDemography