DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Medical image classifiers suffer from poor interpretability, while large language models (LLMs) exhibit low stability and clinical credibility in visual reasoning, leading to a critical misalignment between AI decisions and clinical requirements. Method: We propose MobileCoAtNet—a hybrid vision model—and a multi-LLM collaborative reasoning framework, pioneering an end-to-end approach that jointly performs image classification and structured clinical narrative generation. We construct a dual-expert-validated benchmark covering five clinically essential dimensions (e.g., etiology, symptoms, treatment) and systematically evaluate 32 LLMs. Contribution/Results: Our analysis reveals, for the first time, extreme sensitivity of medical LLMs to prompt engineering—minor prompt variations induce substantial output instability. MobileCoAtNet achieves high-accuracy classification across eight gastric disease categories; however, no evaluated LLM attains human-level reasoning stability. We publicly release all code and datasets to advance trustworthy, clinically aligned AI in medicine.

Technology Category

Application Category

📝 Abstract

Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.

Problem

Research questions and friction points this paper is trying to address.

Bridging medical image classification with clinical reasoning explanations

Evaluating LLMs' stability in generating expert-level medical narratives

Developing a hybrid framework to improve diagnostic transparency and reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines deep learning with large language models for medical reasoning

Uses MobileCoAtNet hybrid model for accurate endoscopic image classification

Creates expert-verified benchmarks to evaluate LLM reasoning quality

🔎 Similar Papers

No similar papers found.