🤖 AI Summary
To address the challenges of organ heterogeneity, modality diversity, and difficulty in modeling local–global relationships in 3D medical imaging (CT/MRI), this work proposes the first voxel-level visual token sequence modeling paradigm for 3D medical images. It represents local anatomical regions as token sequences constrained by spatial proximity, intensity contrast, and semantic consistency, and employs autoregressive prediction to capture long-range contextual dependencies. A novel random-start pretraining strategy is introduced to mitigate overfitting in token relational learning, thereby substantially improving representation robustness and generalization. Evaluated on nine public downstream tasks—including segmentation, detection, and classification—the method consistently outperforms state-of-the-art approaches. It demonstrates exceptional adaptability across diverse organ anatomies, heterogeneous imaging modalities (e.g., CT vs. MRI), and multi-task clinical diagnostic scenarios. This work establishes a new foundational paradigm for self-supervised representation learning in 3D medical imaging.
📝 Abstract
Three-dimensional (3D) medical images, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), are essential for clinical applications. However, the need for diverse and comprehensive representations is particularly pronounced when considering the variability across different organs, diagnostic tasks, and imaging modalities. How to effectively interpret the intricate contextual information and extract meaningful insights from these images remains an open challenge to the community. While current self-supervised learning methods have shown potential, they often consider an image as a whole thereby overlooking the extensive, complex relationships among local regions from one or multiple images. In this work, we introduce a pioneering method for learning 3D medical image representations through an autoregressive pre-training framework. Our approach sequences various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence. By employing an autoregressive sequence modeling task, we predict the next visual token in the sequence, which allows our model to deeply understand and integrate the contextual information inherent in 3D medical images. Additionally, we implement a random startup strategy to avoid overestimating token relationships and to enhance the robustness of learning. The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.