Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Prior work lacks systematic evaluation of multimodal large language models’ (MLLMs) spatial reasoning capabilities for lesion localization in chest X-rays, particularly due to the absence of explicit coordinate-output mechanisms in medical imaging tasks. Method: We propose an anatomy-aware spatial grid prompting framework that guides GPT-4, GPT-5, and MedGemma to directly generate normalized lesion coordinates—bypassing post-hoc regression or segmentation. Contribution/Results: On a clinical chest X-ray benchmark, GPT-5 achieves 49.7% localization accuracy (center-point error ≤30 mm), substantially outperforming CNN baselines and demonstrating that 86.2% of its predictions fall within anatomically plausible regions—despite lagging behind radiologists (80.1%). This is the first empirical validation that MLLMs possess rudimentary anatomical spatial understanding. Moreover, our prompt engineering paradigm enables zero-shot, scalable lesion localization without task-specific fine-tuning or auxiliary modules.

Technology Category

Application Category

📝 Abstract

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, showing limited capacity to generalize to this novel task. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to localize pathologies on chest radiographs

Assessing spatial understanding of anatomy and disease in medical images

Comparing general-purpose and domain-specific models for pathology localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating MLLMs using a spatial grid overlay

Prompting pipeline for coordinate-based pathology localization

Systematic comparison with domain-specific and general-purpose models

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis