ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge of learning transferable and interpretable representations from medical 3D images, which is hindered by complex anatomical structures and weak, heterogeneous supervisory signals from radiology reports. To this end, the authors propose an anatomy-aware semantic adaptive pretraining framework that integrates organ-level structural priors with report text through a dynamic image–text alignment mechanism and bimodal masked modeling, enabling fine-grained vision–language representation learning. The study establishes the first comprehensive pretraining benchmark for chest CT, encompassing 15 datasets and 22 downstream tasks. The proposed method significantly outperforms existing approaches across diverse tasks—including abnormality classification, segmentation, prognosis prediction, and report generation—particularly excelling in scenarios with scarce annotations and distribution shifts.

📝 Abstract

Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.

Problem

Research questions and friction points this paper is trying to address.

medical volumetric representation learning

anatomy-aware

vision-language pre-training

weak supervision

transferable representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

anatomy-aware

semantically-adaptive

volumetric representation learning