CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current medical multimodal large language models (MLLMs) perform adequately on simplified evaluation tasks but struggle with the complexity of real-world clinical interpretation of cardiac magnetic resonance (CMR) imaging. To address this gap, this work proposes CardioLens—the first multi-sequence CMR evaluation framework designed to reflect authentic clinical workflows. Built upon private hospital data, CardioLens features a leakage-resistant test set generated through a report-to-QA pipeline covering 4D Cine, late gadolinium enhancement (LGE), perfusion, and T2-weighted sequences, augmented with human validation and diverse slice selection strategies. Evaluation across 24 state-of-the-art MLLMs reveals consistently poor performance that degrades along the interpretation pipeline; slice selection strategies yield only marginal (~1%) variations, and explicit reasoning prompts prove ineffective. The results expose critical failure modes such as “class collapse,” underscoring a substantial gap between current models and clinically reliable deployment.

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Cardiovascular Magnetic Resonance

Clinical Reality Gap

Medical Image Interpretation

Evaluation Benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

CardioLens

Multimodal Large Language Models

Cardiovascular MRI