ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the challenge of learning transferable and interpretable representations from medical 3D images, which is hindered by complex anatomical structures and weak, heterogeneous supervisory signals from radiology reports. To this end, the authors propose an anatomy-aware semantic adaptive pretraining framework that integrates organ-level structural priors with report text through a dynamic image–text alignment mechanism and bimodal masked modeling, enabling fine-grained vision–language representation learning. The study establishes the first comprehensive pretraining benchmark for chest CT, encompassing 15 datasets and 22 downstream tasks. The proposed method significantly outperforms existing approaches across diverse tasks—including abnormality classification, segmentation, prognosis prediction, and report generation—particularly excelling in scenarios with scarce annotations and distribution shifts.
📝 Abstract
Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.
Problem

Research questions and friction points this paper is trying to address.

medical volumetric representation learning
anatomy-aware
vision-language pre-training
weak supervision
transferable representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

anatomy-aware
semantically-adaptive
volumetric representation learning
vision-language pre-training
medical imaging
🔎 Similar Papers
No similar papers found.
Rongsheng Wang
Rongsheng Wang
The Chinese University of Hong Kong, Shenzhen
Deep Learning
Fenghe Tang
Fenghe Tang
University of Science and Technology of China
Medical Image AnalysisFoundation model
Zihang Jiang
Zihang Jiang
School of Biomedical Engineering, USTC, Suzhou Institute for Advanced Research
Computer VisionMedical Imaging3D
Yingtai Li
Yingtai Li
University of Science & Technology of China
Xu Zhang
Xu Zhang
University of Science and Technology of China
Clinical NLPMedical Imaging
Haoran Lai
Haoran Lai
University of Science and Technology of China
Medical Image ProcessingDeep Learning
Wenxin Ma
Wenxin Ma
University of Science and Technology of China
AIcomputer vision
W
Wei Wei
Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC, Hefei, Anhui, 230001, China
Zhiyang He
Zhiyang He
Massachusetts Institute of Technology
Quantum Information
X
Xiaodong Tao
Anhui IFLYTEK CO., Ltd.
R
Rui Yan
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei, Anhui, China 230026; Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE) Lab, YRD-RIGHT, USTC Suzhou Institute for Advanced Research, Suzhou, Jiangsu, China 215123; Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology, Suzhou, Jiangsu, China 215123; Biomedical Basic Research Center (BBRC) of Jiangsu Province, Suzhou, Jiangsu, China 215123
Qingsong Yao
Qingsong Yao
Stanford University | ICT, CAS
Medical Image ComputingMedical Image Analysis
Shaohua Kevin Zhou
Shaohua Kevin Zhou
Professor, USTC, FAIMBE, FIAMBE, FIEEE, FMICCAI, FNAI
Medical Image ComputingComputer Vision & Pattern RecognitionMachine & Deep Learning