🤖 AI Summary
Clinical reports accompanying chest X-ray (CXR) images exhibit abundant abbreviations, heterogeneous stylistic variations, and high noise levels, severely impairing generalization in CXR image–text alignment.
Method: We propose LL2VEC4CXR and LL2CLIP4CXR—dual-encoder frameworks that integrate domain-adapted large language models (LLMs) as text encoders into medical vision–language modeling for the first time, prioritizing robust textual representation over data scale. Our approach fuses LLM-based text encoders with visual backbones and optimizes clinical semantic embeddings via contrastive learning.
Contribution/Results: Trained on 1.6 million multi-source, noisy CXR image–report pairs, both models substantially outperform state-of-the-art methods in report-level alignment and cross-modal retrieval accuracy. Crucially, they demonstrate superior cross-dataset generalization, validating that LLM-driven robust representations are pivotal for reliable medical cross-modal alignment.
📝 Abstract
Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports, including abbreviations, impression-only notes, and stylistic variability. Unlike general-domain settings where more data often leads to better performance, naively scaling to large collections of noisy reports can plateau or even degrade model learning. We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles and better guide image-text alignment. We introduce LLM2VEC4CXR, a domain-adapted LLM encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone. LLM2VEC4CXR improves clinical text understanding over BERT-based baselines, handles abbreviations and style variation, and achieves strong clinical alignment on report-level metrics. LLM2CLIP4CXR leverages these embeddings to boost retrieval accuracy and clinically oriented scores, with stronger cross-dataset generalization than prior medical CLIP variants. Trained on 1.6M CXR studies from public and private sources with heterogeneous and noisy reports, our models demonstrate that robustness -- not scale alone -- is the key to effective multimodal learning. We release models to support further research in medical image-text representation learning.