Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the narrative medicine reasoning capabilities of large language models (LLMs) in rare disease diagnosis. Addressing the absence of standardized, educationally grounded evaluation benchmarks, we introduce the first pedagogically validated benchmark for rare disease diagnosis: a high-quality, narrative-based dataset comprising 176 clinical cases derived from *House M.D.* episodes. We systematically evaluate state-of-the-art models—including GPT-4o mini, GPT-5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro—on this benchmark, observing diagnostic accuracy ranging from 16.48% to 38.64%. Notably, next-generation models achieve up to a 2.3× improvement over predecessors, demonstrating that architectural advances substantially enhance medical reasoning. The benchmark is fully open-sourced and reproducible, establishing the first publicly available, rigorously designed, education-oriented evaluation framework for AI-assisted rare disease diagnosis.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' rare disease diagnostic accuracy
Assessing narrative medical case reasoning capabilities
Establishing benchmark for AI-assisted diagnosis research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used House M.D. TV series dataset
Evaluated four state-of-the-art LLMs
Established educationally validated benchmark framework
🔎 Similar Papers
No similar papers found.
A
Arsh Gupta
The Pennsylvania State University
A
Ajay Narayanan Sridhar
The Pennsylvania State University
B
Bonam Mingole
The Pennsylvania State University
Amulya Yadav
Amulya Yadav
Assistant Professor, Penn State
Multi Agent Sequential Decision MakingSocial NetworksInfluence MaximizationMachine LearningGame Theory