🤖 AI Summary
Large language models (LLMs) exhibit insufficient diagnostic accuracy for acute conditions in medical emergency consultations. Method: We propose MEDAS, a meta-learning–based hyper-learning framework that dynamically fuses diagnostic outputs from five state-of-the-art LLMs—Gemini, Llama, Grok, GPT, and Claude—modeling inter-model knowledge complementarity and task-specific adaptability to enable real-time, collaborative decision support in emergency settings. Contribution/Results: Experimental evaluation shows the ensemble achieves 70% diagnostic accuracy—lower than the best individual model (85%) but significantly surpassing the average clinician performance (~62%). Crucially, MEDAS is the first framework to empirically validate a robust collective knowledge gain mechanism across multiple LLMs under high-uncertainty emergency tasks. It establishes an interpretable, scalable, and trustworthy multi-LLM collaboration paradigm for clinical AI, advancing reliability and transparency in safety-critical healthcare applications.
📝 Abstract
Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients'conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.