🤖 AI Summary
This study addresses the reliability limitations of human evaluation in AI model assessment, which are often compromised by rater biases such as severity and central tendency. To mitigate these issues, the work systematically introduces the Multifaceted Rasch Model—a psychometric framework grounded in Item Response Theory—into the human evaluation pipeline for AI-generated outputs. This approach explicitly models rater effects, enabling effective disentanglement of the true textual quality from systematic scoring biases. Empirical validation on the OpenAI Summarization dataset demonstrates that the proposed method significantly corrects for rater bias, thereby enhancing construct validity and transparency in evaluation. Furthermore, it yields more accurate quality estimates and facilitates diagnostic insights into individual rater performance, offering a robust foundation for informed decision-making in AI output assessment.
📝 Abstract
Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.