ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

A critical global shortage of radiologists exacerbates the imbalance between demand and supply in medical image diagnosis. To address this, we propose a vision–language collaborative framework that—uniquely—integrates the EVA Vision Transformer with the Llama 2 large language model (LLM) for end-to-end disease classification and lesion localization from chest X-ray images. Leveraging task-specific prompt engineering and transfer learning, our method encodes visual features into semantic tokens fed into the LLM, which jointly generates interpretable diagnostic reports and bounding boxes for pathological regions. Evaluated on the VinDr-CXR dataset, the framework achieves an F1 score of 0.76, demonstrating substantial improvements in both diagnostic efficiency and interpretability. This work establishes a novel paradigm for LLM-augmented medical imaging analysis, combining clinical applicability with methodological innovation.

Technology Category

Application Category

📝 Abstract

The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

Problem

Research questions and friction points this paper is trying to address.

Detect and localize diseases in chest X-rays using AI

Combine vision transformers and large language models for medical imaging

Improve radiologists' efficiency with automated preliminary findings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates EVA ViT with Llama 2 LLM

Converts X-ray images into tokens

Uses transfer learning for better performance

🔎 Similar Papers

Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification