Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Conventional LLM medical evaluation relies solely on aggregate accuracy, obscuring fine-grained differences in model proficiency across specific medical domains. Method: We propose MedIRT—the first systematic application of unidimensional two-parameter logistic Item Response Theory (2PL-IRT) to assess LLMs’ medical competencies. Leveraging response data from 80 models on 1,100 USMLE-aligned questions, MedIRT jointly estimates model ability, item difficulty, and item discrimination. Contribution/Results: We discover a “peaked” distribution of model abilities, revealing that global rankings mask domain-specific strengths. Building upon this, we construct multidimensional competency profiles and a decision-support framework. Experiments demonstrate MedIRT’s robustness in model ranking (e.g., GPT-5 leads in 8 of 11 domains; Claude-3-Opus surpasses peers in social sciences and communication skills) and its efficacy in identifying flawed items—enhancing assessment reliability and granularity, particularly in high-stakes clinical scenarios.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce extsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While exttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by exttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.

Problem

Research questions and friction points this paper is trying to address.

Evaluating medical LLMs beyond overall accuracy metrics

Identifying topic-specific performance variations in language models

Developing psychometric framework for reliable medical competency assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Item Response Theory framework for medical evaluation

Multi-dimensional ability profiling using IRT parameters

Benchmark auditing through psychometric question analysis

🔎 Similar Papers

No similar papers found.

Authors to Follow