Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address semantic disalignment between visual and textual embeddings, challenges in cross-modal alignment, and severe annotation scarcity for rare diseases in zero-shot 3D medical image diagnosis, this paper proposes a Semantic Bridging Framework. The method leverages large language model (LLM)-generated clinical report summaries as semantic anchors to construct dynamic vision–language bridges, and introduces a cross-modal knowledge interaction and contrastive alignment module to enable fully unsupervised zero-shot transfer. Evaluated on three benchmark datasets encompassing 15 rare abnormalities, our approach achieves state-of-the-art performance, with substantial gains in zero-shot diagnostic accuracy—particularly for lesions with extremely limited annotations—demonstrating strong generalization capability. The core innovation lies in the first structured cross-modal alignment between LLM-derived semantic summaries and 3D medical image embeddings, effectively bridging the modality gap.

Technology Category

Application Category

📝 Abstract

3D medical images such as Computed tomography (CT) are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction (CMKI) module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised Learning

3D Medical Image Analysis

Rare Disease Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridging Semantic Alignment (BrgSA)

Cross-modal Knowledge Interaction (CMKI)

Rare Disease Diagnosis in 3D Medical Images

🔎 Similar Papers

No similar papers found.

Authors to Follow