Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the narrative medicine reasoning capabilities of large language models (LLMs) in rare disease diagnosis. Addressing the absence of standardized, educationally grounded evaluation benchmarks, we introduce the first pedagogically validated benchmark for rare disease diagnosis: a high-quality, narrative-based dataset comprising 176 clinical cases derived from *House M.D.* episodes. We systematically evaluate state-of-the-art models—including GPT-4o mini, GPT-5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro—on this benchmark, observing diagnostic accuracy ranging from 16.48% to 38.64%. Notably, next-generation models achieve up to a 2.3× improvement over predecessors, demonstrating that architectural advances substantially enhance medical reasoning. The benchmark is fully open-sourced and reproducible, establishing the first publicly available, rigorously designed, education-oriented evaluation framework for AI-assisted rare disease diagnosis.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' rare disease diagnostic accuracy

Assessing narrative medical case reasoning capabilities

Establishing benchmark for AI-assisted diagnosis research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used House M.D. TV series dataset

Evaluated four state-of-the-art LLMs

Established educationally validated benchmark framework

🔎 Similar Papers

No similar papers found.

Authors to Follow