🤖 AI Summary
This study investigates whether general-purpose vision models can serve as viable alternatives to specialized architectures for 2D medical image segmentation. Under a unified training and evaluation protocol, the authors systematically compare eleven medical-specific models against multiple general vision models across three heterogeneous medical imaging datasets, complemented by Grad-CAM analyses to assess interpretability. The results demonstrate that, in most cases, general-purpose models achieve segmentation accuracy on par with or superior to specialized counterparts while effectively capturing clinically relevant anatomical structures. These findings challenge the prevailing assumption that domain-specific architectural design is indispensable for medical image segmentation and, for the first time, establish the feasibility and promise of general vision models on multimodal, heterogeneous medical data.
📝 Abstract
Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.