π€ AI Summary
To address insufficient multi-scale feature fusion and severe foreground-background class imbalance in retinal vessel segmentation, this paper proposes SA-UNetv2βa lightweight U-Net variant. Methodologically, it introduces cross-scale spatial attention mechanisms across *all* skip connections for adaptive, weighted fusion of multi-level encoder-decoder featuresβa first in U-Net architectures. Furthermore, it jointly optimizes a weighted binary cross-entropy (WBCE) loss and a Matthews Correlation Coefficient (MCC)-based loss to explicitly mitigate class imbalance. The model achieves state-of-the-art performance on DRIVE and STARE benchmarks with only 0.26M parameters and 1.2MB memory footprint. It processes a single 592Γ592 image in just one second on CPU, demonstrating high accuracy, strong robustness, and exceptional suitability for edge deployment.
π Abstract
Retinal vessel segmentation is essential for early diagnosis of diseases such as diabetic retinopathy, hypertension, and neurodegenerative disorders. Although SA-UNet introduces spatial attention in the bottleneck, it underuses attention in skip connections and does not address the severe foreground-background imbalance. We propose SA-UNetv2, a lightweight model that injects cross-scale spatial attention into all skip connections to strengthen multi-scale feature fusion and adopts a weighted Binary Cross-Entropy (BCE) plus Matthews Correlation Coefficient (MCC) loss to improve robustness to class imbalance. On the public DRIVE and STARE datasets, SA-UNetv2 achieves state-of-the-art performance with only 1.2MB memory and 0.26M parameters (less than 50% of SA-UNet), and 1 second CPU inference on 592 x 592 x 3 images, demonstrating strong efficiency and deployability in resource-constrained, CPU-only settings.