Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can serve as psychometrically valid surrogates for human participants in educational test pretesting. Method: Using multiple-choice item datasets from reading comprehension, U.S. history, and economics, we systematically evaluate 18 instruction-tuned LLMs under both Classical Test Theory (CTT) and Item Response Theory (IRT) frameworks—introducing temperature scaling for response calibration. Results: After temperature calibration, larger LLMs exhibit response distributions significantly closer to human benchmarks; reading comprehension items yield the highest human–LLM response correlations, though overall correlations remain modest. Zero-shot LLM responses are not yet viable for operational pretesting. Contribution: We establish the first dual-theoretical (CTT + IRT) evaluation paradigm for assessing the psychometric reasonableness of LLM responses in educational assessment, and empirically validate temperature scaling as a critical calibration mechanism for improving alignment with human response patterns.

Technology Category

Application Category

📝 Abstract

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

Problem

Research questions and friction points this paper is trying to address.

Evaluating human-like response behavior of LLMs in educational assessments

Assessing psychometric plausibility of LLM responses across multiple subjects

Exploring LLMs as potential pilot participants for test development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs using classical test theory

Calibrates models with temperature scaling

Assesses human-like responses in multiple subjects

🔎 Similar Papers

No similar papers found.