🤖 AI Summary
This study investigates the narrative medicine reasoning capabilities of large language models (LLMs) in rare disease diagnosis. Addressing the absence of standardized, educationally grounded evaluation benchmarks, we introduce the first pedagogically validated benchmark for rare disease diagnosis: a high-quality, narrative-based dataset comprising 176 clinical cases derived from *House M.D.* episodes. We systematically evaluate state-of-the-art models—including GPT-4o mini, GPT-5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro—on this benchmark, observing diagnostic accuracy ranging from 16.48% to 38.64%. Notably, next-generation models achieve up to a 2.3× improvement over predecessors, demonstrating that architectural advances substantially enhance medical reasoning. The benchmark is fully open-sourced and reproducible, establishing the first publicly available, rigorously designed, education-oriented evaluation framework for AI-assisted rare disease diagnosis.
📝 Abstract
Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.