Evaluating LLM-Generated Q&A Test: a Student-Centered Study

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether AI-generated educational assessments can match human-authored items in psychometric quality and user satisfaction. Method: We developed an automated NLP course quiz generation pipeline using GPT-4o-mini and conducted the first integrated psychometric evaluation combining unidimensional and multidimensional Item Response Theory (IRT) with Differential Item Functioning (DIF) analysis. A mixed-methods assessment was performed from both student and domain-expert perspectives. Contribution/Results: LLM-generated items demonstrated strong discrimination and appropriate difficulty levels, with stable IRT parameter estimates. DIF analysis identified only two potentially biased items requiring review. Students and experts rated item quality, clarity, and pedagogical alignment highly (mean ≥4.3/5). This work establishes a methodological framework and empirical foundation for AI-powered, scalable, interpretable, and psychometrically sound educational assessment.

Technology Category

Application Category

📝 Abstract

This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.

Problem

Research questions and friction points this paper is trying to address.

Develops AI pipeline for reliable Q&A test generation

Evaluates GPT-4o-mini test quality with students and experts

Compares LLM-generated and human-authored test performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated GPT-4o-mini Q&A test generation

Mixed-format IRT analysis for quality metrics

Uniform DIF check for item review

🔎 Similar Papers

No similar papers found.

Authors to Follow