PaliGemma-CXR: A Multi-task Multimodal Model for TB Chest X-ray Interpretation

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the critical shortage of radiologists in primary healthcare settings and the challenges of intelligent interpretation of chest X-rays for tuberculosis (TB) screening, this paper proposes the first unified multi-task, multimodal model specifically designed for TB chest X-ray analysis. To mitigate negative transfer across tasks, severe class imbalance, and scarcity of multimodal training samples, we introduce a novel cross-task collaborative training paradigm built upon the PaliGemma architecture, incorporating inverse data-scale sampling, task-weighted sampling, joint loss optimization, and multimodal alignment mechanisms. The model simultaneously supports four clinical tasks: TB diagnosis, lesion detection/segmentation, radiology report generation, and visual question answering (VQA). Experimental results demonstrate state-of-the-art performance: 90.32% accuracy for TB classification, 98.95% accuracy for closed-ended VQA, BLEU-4 score of 41.3 for report generation, and mAP scores of 19.4 (detection) and 16.0 (segmentation). This framework significantly enhances the automation and accessibility of TB screening in resource-limited settings.

Technology Category

Application Category

📝 Abstract
Tuberculosis (TB) is a infectious global health challenge. Chest X-rays are a standard method for TB screening, yet many countries face a critical shortage of radiologists capable of interpreting these images. Machine learning offers an alternative, as it can automate tasks such as disease diagnosis, and report generation. However, traditional approaches rely on task-specific models, which cannot utilize the interdependence between tasks. Building a multi-task model capable of performing multiple tasks poses additional challenges such as scarcity of multimodal data, dataset imbalance, and negative transfer. To address these challenges, we propose PaliGemma-CXR, a multi-task multimodal model capable of performing TB diagnosis, object detection, segmentation, report generation, and VQA. Starting with a dataset of chest X-ray images annotated with TB diagnosis labels and segmentation masks, we curated a multimodal dataset to support additional tasks. By finetuning PaliGemma on this dataset and sampling data using ratios of the inverse of the size of task datasets, we achieved the following results across all tasks: 90.32% accuracy on TB diagnosis and 98.95% on close-ended VQA, 41.3 BLEU score on report generation, and a mAP of 19.4 and 16.0 on object detection and segmentation, respectively. These results demonstrate that PaliGemma-CXR effectively leverages the interdependence between multiple image interpretation tasks to enhance performance.
Problem

Research questions and friction points this paper is trying to address.

Addresses the shortage of radiologists for TB chest X-ray interpretation.
Develops a multi-task model for TB diagnosis and image analysis tasks.
Overcomes challenges like data scarcity and task interdependence in machine learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task multimodal model for TB diagnosis
Finetuning PaliGemma on curated multimodal dataset
Sampling data using inverse task dataset ratios
🔎 Similar Papers
No similar papers found.