Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?

πŸ“… 2024-06-24
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational cost of Vision Transformer (ViT)-based models and the limited global modeling capability of CNN-based approaches in 3D medical image segmentation, this paper proposes U-VixLSTMβ€”a novel architecture that integrates lightweight Vision-xLSTM modules into the UNet encoder-decoder framework. Local features are extracted via CNNs, while xLSTM captures cross-block spatiotemporal and long-range dependencies. Crucially, we introduce patch-wise temporal unfolding and gated state update mechanisms to enhance representational efficiency. Experimental results on Synapse, ISIC, and ACDC datasets demonstrate that U-VixLSTM outperforms state-of-the-art methods: it achieves 1.2–2.8% higher Dice scores, 23% faster inference speed, 37% reduced GPU memory consumption, and significantly lower parameter count and memory footprint. This work establishes a new paradigm for efficient, deployable 3D medical image segmentation.

Technology Category

Application Category

πŸ“ Abstract
The development of efficient segmentation strategies for medical images has evolved from its initial dependence on Convolutional Neural Networks (CNNs) to the current investigation of hybrid models that combine CNNs with Vision Transformers. There is an increasing focus on creating architectures that are both high-performance and computationally efficient, able to be deployed on remote systems with limited resources. Although transformers can capture global dependencies in the input space, they face challenges from the corresponding high computational and storage expenses involved. This paper investigates the integration of CNNs with Vision Extended Long Short-Term Memory (Vision-xLSTM)s by introducing the novel {it extbf{U-VixLSTM}}. The Vision-xLSTM blocks capture temporal and global relationships within the patches, as extracted from the CNN feature maps. The convolutional feature reconstruction path upsamples the output volume from the Vision-xLSTM blocks, to produce the segmentation output. Our primary objective is to propose that Vision-xLSTM forms an appropriate backbone for medical image segmentation, offering excellent performance with reduced computational costs. The U-VixLSTM exhibits superior performance, compared to the state-of-the-art networks in the publicly available Synapse, ISIC and ACDC datasets. Code provided: https://github.com/duttapallabi2907/U-VixLSTM
Problem

Research questions and friction points this paper is trying to address.

Develop efficient medical image segmentation with hybrid CNN-Vision-xLSTM models
Reduce computational costs while maintaining high segmentation performance
Improve global dependency capture in 3D medical image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CNNs with Vision-xLSTM for segmentation
Uses Vision-xLSTM to capture temporal and global relationships
Upsamples features for efficient medical image segmentation
πŸ”Ž Similar Papers
No similar papers found.
P
Pallabi Dutta
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India
S
Soham Bose
Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
S
S. K. Roy
Department of Computer Science and Engineering, Alipurduar Government Engineering and Management College, West Bengal 736206, India
S
S. Mitra
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India