PATCH - Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

📅 2024-04-02
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks suffer from three critical limitations: insufficient measurement validity, unquantified item quality, and an ill-defined human reference framework. To address these, this paper proposes PATCH—a psychometrically grounded evaluation framework that systematically integrates Item Response Theory (IRT) and multidimensional ability modeling to establish a standardized human-AI capability alignment paradigm. Using 8th-grade mathematics as a case study, PATCH calibrates the ability distributions of GPT-4 and Gemini-Pro-Vision. Key contributions include: (1) releasing four high-quality, education-domain human-AI alignment datasets; (2) constructing comparable human norms across 56 student cohorts, revealing substantial divergence between conventional accuracy metrics and psychometrically derived ability estimates; and (3) empirically validating the critical influence of psychometric item parameters—such as difficulty and discrimination—on LLM evaluation, thereby establishing a novel, trustworthy, and interpretable paradigm for multimodal model assessment.

Technology Category

Application Category

📝 Abstract
Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs' academic proficiency, often with also an interest in comparing model performance with human test takers. While these benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics - a field dedicated to the measurement of latent variables like academic proficiency - into LLM benchmarking. We make three primary contributions. First, we introduce PATCH: a novel framework for {P}sychometrics-{A}ssis{T}ed ben{CH}marking of LLMs. PATCH addresses the aforementioned limitations, presenting a new direction for LLM benchmark research. Second, we implement PATCH by measuring GPT-4 and Gemini-Pro-Vision's proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on existing benchmarking practices. Third, we release 4 high-quality datasets to support measuring and comparing LLM proficiency in grade school mathematics and science against human populations.
Problem

Research questions and friction points this paper is trying to address.

Improving measurement quality in LLM benchmarking
Enabling valid LLM-human performance comparisons
Addressing item-level quality assessment limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Psychometrics-assisted benchmarking framework PATCH
Valid LLM-human population comparison
High-quality datasets for LLM proficiency
🔎 Similar Papers
No similar papers found.
Q
Qixiang Fang
Department of Methodology & Statistics, Utrecht University
D
D. Oberski
Department of Methodology & Statistics, Utrecht University
D
Dong Nguyen
Department of Computing Sciences, University of Utrecht