🤖 AI Summary
Conventional LLM medical evaluation relies solely on aggregate accuracy, obscuring fine-grained differences in model proficiency across specific medical domains.
Method: We propose MedIRT—the first systematic application of unidimensional two-parameter logistic Item Response Theory (2PL-IRT) to assess LLMs’ medical competencies. Leveraging response data from 80 models on 1,100 USMLE-aligned questions, MedIRT jointly estimates model ability, item difficulty, and item discrimination.
Contribution/Results: We discover a “peaked” distribution of model abilities, revealing that global rankings mask domain-specific strengths. Building upon this, we construct multidimensional competency profiles and a decision-support framework. Experiments demonstrate MedIRT’s robustness in model ranking (e.g., GPT-5 leads in 8 of 11 domains; Claude-3-Opus surpasses peers in social sciences and communication skills) and its efficacy in identifying flawed items—enhancing assessment reliability and granularity, particularly in high-stakes clinical scenarios.
📝 Abstract
As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce extsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While exttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by exttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.