🤖 AI Summary
Medical image classifiers suffer from poor interpretability, while large language models (LLMs) exhibit low stability and clinical credibility in visual reasoning, leading to a critical misalignment between AI decisions and clinical requirements. Method: We propose MobileCoAtNet—a hybrid vision model—and a multi-LLM collaborative reasoning framework, pioneering an end-to-end approach that jointly performs image classification and structured clinical narrative generation. We construct a dual-expert-validated benchmark covering five clinically essential dimensions (e.g., etiology, symptoms, treatment) and systematically evaluate 32 LLMs. Contribution/Results: Our analysis reveals, for the first time, extreme sensitivity of medical LLMs to prompt engineering—minor prompt variations induce substantial output instability. MobileCoAtNet achieves high-accuracy classification across eight gastric disease categories; however, no evaluated LLM attains human-level reasoning stability. We publicly release all code and datasets to advance trustworthy, clinically aligned AI in medicine.
📝 Abstract
Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.