🤖 AI Summary
To address SAM’s inability to produce semantic labels, its reliance on manual prompts, and its limited performance in 3D medical image segmentation, this work proposes the first fully automatic 3D medical image segmentation framework. Methodologically, it introduces: (1) a mask classification–driven automatic prompt generation mechanism that eliminates manual intervention for prompt initialization; (2) a fusion strategy integrating global and auxiliary classifier tokens to enhance semantic consistency; and (3) a lightweight 3D deep convolutional/MLP adapter for efficient fine-tuning of the SAM backbone. The framework comprises a prompt generator, a dual-path 3D adapter, and a mask classification decoder. Evaluated on AMOS2022, it achieves 90.52% Dice score—surpassing nnUNet by 2.7%—and yields improvements of 1.7% and 1.0% on ACDC and Synapse, respectively. This advances SAM toward clinically deployable, fully automatic semantic segmentation.
📝 Abstract
Segment Anything Model (SAM), a prompt-driven foundation model for natural image segmentation, has demonstrated impressive zero-shot performance. However, SAM does not work when directly applied to medical image segmentation, since SAM lacks the ability to predict semantic labels, requires additional prompts, and presents suboptimal performance. Following the above issues, we propose MaskSAM, a novel mask classification prompt-free SAM adaptation framework for medical image segmentation. We design a prompt generator combined with the image encoder in SAM to generate a set of auxiliary classifier tokens, auxiliary binary masks, and auxiliary bounding boxes. Each pair of auxiliary mask and box prompts can solve the requirements of extra prompts. The semantic label prediction can be addressed by the sum of the auxiliary classifier tokens and the learnable global classifier tokens in the mask decoder of SAM. Meanwhile, we design a 3D depth-convolution adapter for image embeddings and a 3D depth-MLP adapter for prompt embeddings to efficiently fine-tune SAM. Our method achieves state-of-the-art performance on AMOS2022, 90.52% Dice, which improved by 2.7% compared to nnUNet. Our method surpasses nnUNet by 1.7% on ACDC and 1.0% on Synapse datasets.