π€ AI Summary
This work addresses the challenge in medical image segmentation of simultaneously capturing global anatomical structures and fine boundary details, a limitation exacerbated by existing state space models that process images as one-dimensional sequences, thereby compromising local spatial continuity and high-frequency information. To overcome this, we propose SpectralMamba-UNet, a novel frequency-domain decoupling framework that, for the first time, integrates frequency decomposition into state space modeling. Specifically, the discrete cosine transform separates low-frequency (structural) and high-frequency (textural) components; the former is processed by a frequency-domain Mamba module to model global context, while the latter preserves boundary details. A spectral channel reweighting attention mechanism and a spectrum-guided fusion strategy enable adaptive multi-scale integration. Extensive experiments on five public medical image segmentation datasets demonstrate consistent performance gains, validating the methodβs effectiveness and generalization across multimodal and multi-target scenarios.
π Abstract
Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.