🤖 AI Summary
To address the trade-off between insufficient global modeling capability and excessive computational overhead in real-time binary segmentation of medical images, this paper proposes a lightweight end-to-end architecture. It employs a shallow Swin Transformer–like encoder for efficient global contextual modeling and integrates a U-Net–like decoder with skip connections to preserve fine-grained spatial details. Notably, this work is the first to incorporate Barlow Twins self-supervised pretraining into such hybrid architectures, substantially enhancing feature representation learning from unlabeled data. Evaluated on multiple benchmark medical image datasets, the method achieves competitive segmentation accuracy while reducing model parameters by approximately 40% compared to state-of-the-art Transformer-based approaches and accelerating inference speed by 2.1×. These advantages make it particularly suitable for resource-constrained, real-time clinical applications.
📝 Abstract
Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.