Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The prevailing assumption that domain-specific clinical AI systems are inherently safer and more reliable than general-purpose foundation models lacks rigorous empirical validation. Method: We conduct a systematic, independent evaluation of state-of-the-art general-purpose large language models (GPT-5, Gemini 3 Pro, Claude Sonnet 4.5) against specialized clinical tools (OpenEvidence, UpToDate Expert AI) across a comprehensive medical micro-benchmark suite (MedQA + HealthBench, n=1,000 cases), using multidimensional metrics—including answer completeness, patient-centered communication quality, contextual reasoning fidelity, and systematic safety-aware inference. Results: General-purpose models significantly outperform specialized systems across all dimensions, with GPT-5 achieving highest overall performance; specialized tools exhibit critical deficiencies in safety-critical reasoning and clinical workflow alignment. This study provides the first high-coverage, third-party assessment revealing fundamental capability gaps in current clinical decision support systems and proposes an urgently needed transparent, reproducible, patient-workflow-oriented evaluation framework—a paradigm shift essential for trustworthy clinical AI deployment.

Technology Category

Application Category

📝 Abstract
Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.
Problem

Research questions and friction points this paper is trying to address.

Evaluating clinical AI tools against generalist LLMs
Identifying performance gaps in medical decision support
Highlighting need for independent AI evaluation in healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated clinical AI tools against generalist LLMs
Used MedQA and HealthBench tasks for benchmark
Found generalist models outperform specialized clinical tools
🔎 Similar Papers
No similar papers found.
K
Krithik Vishwanath
Department of Neurological Surgery, NYU Langone Health, New York, New York, USA
M
Mrigayu Ghosh
Department of Biomedical Engineering, The University of Texas at Austin, Austin, Texas, USA
Anton Alyakin
Anton Alyakin
medical student at washington univesity
llmsneurosurgerynetworkscausality
D
Daniel Alexander Alber
Department of Neurological Surgery, NYU Langone Health, New York, New York, USA
Y
Yindalon Aphinyanaphongs
Global AI Frontier Lab, New York University , New York, New York, USA
Eric Karl Oermann
Eric Karl Oermann
New York University
Artificial IntelligenceHuman Intelligence