LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Evaluating the diagnostic accuracy of open-source large language models (LLMs) across diverse chronic disease specialties remains an unaddressed challenge. Method: This study conducts the first systematic assessment of DeepSeek-R1 and O3 Mini on multi-specialty chronic disease diagnosis, leveraging symptom-to-diagnosis structured data, zero-shot prompting, and clinical knowledge alignment to quantify disease-level and category-level accuracy as well as prediction confidence reliability. Contribution/Results: DeepSeek-R1 achieves 82% overall accuracy (76% disease-level), attaining 100% in psychiatry, neurology, and oncology—but only 40% in pulmonology; O3 Mini reaches 100% in autoimmune disorders but just 20% in pulmonology. The work introduces a novel confidence calibration framework for trustworthy medical AI and a tri-dimensional evaluation system addressing ethics, bias, and privacy. It empirically reveals domain-specific model strengths and limitations, establishing a reproducible methodology and evidence-based benchmark for clinical LLM deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based diagnostic tools for disease classification accuracy.

Assessing reliability of confidence scores in diagnostic predictions.

Addressing ethical concerns in LLM integration into clinical practice.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance disease classification accuracy.

DeepSeek-R1 outperforms O3 Mini in diagnostics.

High-confidence predictions improve clinical decision-making.

🔎 Similar Papers

Large Language Models for Disease Diagnosis: A Scoping Review