DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image classifiers suffer from poor interpretability, while large language models (LLMs) exhibit low stability and clinical credibility in visual reasoning, leading to a critical misalignment between AI decisions and clinical requirements. Method: We propose MobileCoAtNet—a hybrid vision model—and a multi-LLM collaborative reasoning framework, pioneering an end-to-end approach that jointly performs image classification and structured clinical narrative generation. We construct a dual-expert-validated benchmark covering five clinically essential dimensions (e.g., etiology, symptoms, treatment) and systematically evaluate 32 LLMs. Contribution/Results: Our analysis reveals, for the first time, extreme sensitivity of medical LLMs to prompt engineering—minor prompt variations induce substantial output instability. MobileCoAtNet achieves high-accuracy classification across eight gastric disease categories; however, no evaluated LLM attains human-level reasoning stability. We publicly release all code and datasets to advance trustworthy, clinically aligned AI in medicine.

Technology Category

Application Category

📝 Abstract
Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.
Problem

Research questions and friction points this paper is trying to address.

Bridging medical image classification with clinical reasoning explanations
Evaluating LLMs' stability in generating expert-level medical narratives
Developing a hybrid framework to improve diagnostic transparency and reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines deep learning with large language models for medical reasoning
Uses MobileCoAtNet hybrid model for accurate endoscopic image classification
Creates expert-verified benchmarks to evaluate LLM reasoning quality
🔎 Similar Papers
No similar papers found.
M
Md. Najib Hasan
Dept. SoC, Wichita State University, USA
I
Imran Ahmad
Dept. School of Business, Wichita State University, USA
S
Sourav Basak Shuvo
Dept. of BME, KUET, Bangladesh
M
Md. Mahadi Hasan Ankon
Dept. of CSE, KUET, Bangladesh
Sunanda Das
Sunanda Das
Dept. of EECS, University of Arkansas
Nazmul Siddique
Nazmul Siddique
Ulster University
Computational IntelligenceMachine LearningNature-inspired ComputingCyberneticsRobotics
H
Hui Wang
Dept. SoC, Eng. and Intel. Sys., Ulster University, UK