🤖 AI Summary
Existing large language model (LLM) ensembles for clinical decision support are typically static and preconfigured, lacking adaptive component selection mechanisms—leading to insufficient model diversity, poor output consistency, and reliance on manual curation or domain-expert verification. Method: We propose a training-free adaptive collaboration framework that dynamically selects high-diversity, high-consistency submodels via two novel mechanisms: “self-diversity maximization” (leveraging fuzzy matching) and “cross-consistency optimization” (integrating cross-evaluation with progressive masking), enabling real-time ensemble reconfiguration and collaborative inference. Results: Our approach significantly outperforms GPT-4 on NEJMQA and MMLU-Pro-health: it achieves 65.47% accuracy in obstetrics and gynecology and is the first method to meet official passing thresholds across all medical specialties. This work establishes a new paradigm for trustworthy, multi-LLM clinical collaboration.
📝 Abstract
The collaborativeness of large language models (LLMs) has proven effective in natural language processing systems, holding considerable promise for healthcare development. However, it lacks explicit component selection rules, necessitating human intervention or clinical-specific validation. Moreover, existing architectures heavily rely on a predefined LLM cluster, where partial LLMs underperform in medical decision support scenarios, invalidating the collaborativeness of LLMs. To this end, we propose an adaptive cluster collaborativeness methodology involving self-diversity and cross-consistency maximization mechanisms to boost LLMs medical decision support capacity. For the self-diversity, we calculate the fuzzy matching value of pairwise outputs within an LLM as its self-diversity value, subsequently prioritizing LLMs with high self-diversity values as cluster components in a training-free manner. For the cross-consistency, we first measure cross-consistency values between the LLM with the highest self-diversity value and others, and then gradually mask out the LLM having the lowest cross-consistency value to eliminate the potential inconsistent output during the collaborative propagation. Extensive experiments on two specialized medical datasets, NEJMQA and MMLU-Pro-health, demonstrate the effectiveness of our method across physician-oriented specialties. For example, on NEJMQA, our method achieves the accuracy rate up to the publicly official passing score across all disciplines, especially achieving ACC of 65.47% compared to the 56.12% achieved by GPT-4 on the Obstetrics and Gynecology discipline.