🤖 AI Summary
To address semantic disalignment between visual and textual embeddings, challenges in cross-modal alignment, and severe annotation scarcity for rare diseases in zero-shot 3D medical image diagnosis, this paper proposes a Semantic Bridging Framework. The method leverages large language model (LLM)-generated clinical report summaries as semantic anchors to construct dynamic vision–language bridges, and introduces a cross-modal knowledge interaction and contrastive alignment module to enable fully unsupervised zero-shot transfer. Evaluated on three benchmark datasets encompassing 15 rare abnormalities, our approach achieves state-of-the-art performance, with substantial gains in zero-shot diagnostic accuracy—particularly for lesions with extremely limited annotations—demonstrating strong generalization capability. The core innovation lies in the first structured cross-modal alignment between LLM-derived semantic summaries and 3D medical image embeddings, effectively bridging the modality gap.
📝 Abstract
3D medical images such as Computed tomography (CT) are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction (CMKI) module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.