Autoregressive Sequence Modeling for 3D Medical Image Representation

📅 2024-09-13
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of organ heterogeneity, modality diversity, and difficulty in modeling local–global relationships in 3D medical imaging (CT/MRI), this work proposes the first voxel-level visual token sequence modeling paradigm for 3D medical images. It represents local anatomical regions as token sequences constrained by spatial proximity, intensity contrast, and semantic consistency, and employs autoregressive prediction to capture long-range contextual dependencies. A novel random-start pretraining strategy is introduced to mitigate overfitting in token relational learning, thereby substantially improving representation robustness and generalization. Evaluated on nine public downstream tasks—including segmentation, detection, and classification—the method consistently outperforms state-of-the-art approaches. It demonstrates exceptional adaptability across diverse organ anatomies, heterogeneous imaging modalities (e.g., CT vs. MRI), and multi-task clinical diagnostic scenarios. This work establishes a new foundational paradigm for self-supervised representation learning in 3D medical imaging.

Technology Category

Application Category

📝 Abstract
Three-dimensional (3D) medical images, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), are essential for clinical applications. However, the need for diverse and comprehensive representations is particularly pronounced when considering the variability across different organs, diagnostic tasks, and imaging modalities. How to effectively interpret the intricate contextual information and extract meaningful insights from these images remains an open challenge to the community. While current self-supervised learning methods have shown potential, they often consider an image as a whole thereby overlooking the extensive, complex relationships among local regions from one or multiple images. In this work, we introduce a pioneering method for learning 3D medical image representations through an autoregressive pre-training framework. Our approach sequences various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence. By employing an autoregressive sequence modeling task, we predict the next visual token in the sequence, which allows our model to deeply understand and integrate the contextual information inherent in 3D medical images. Additionally, we implement a random startup strategy to avoid overestimating token relationships and to enhance the robustness of learning. The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.
Problem

Research questions and friction points this paper is trying to address.

Effectively interpret complex 3D medical image contexts
Extract meaningful insights from diverse medical imaging modalities
Model intricate local relationships in 3D medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive pre-training for 3D medical images
Sequencing images via spatial and semantic correlations
Random startup strategy to prevent overfitting
🔎 Similar Papers
No similar papers found.
S
Siwen Wang
Deepwise AI Lab
C
Chu-ran Wang
School of Computer Science, Peking University
F
Fei Gao
School of Computer Science, Peking University
L
Lixian Su
Deepwise AI Lab
F
Fandong Zhang
Deepwise AI Lab
Y
Yizhou Wang
CFCS, School of Computer Science, Peking University; Institute for Artificial Intelligence, Peking University
Yizhou Yu
Yizhou Yu
The University of Hong Kong, IEEE Fellow
Machine LearningAI Generated ContentComputer VisionAI for Medicine