🤖 AI Summary
This work addresses the challenges of cross-regional semantic reuse and the confounding effect between visual appearance and vegetation confidence caused by the coupling of NDVI and RGB features in ultra-high-resolution remote sensing imagery for urban green space extraction. To this end, we propose a novel SegFormer-based framework that incorporates a global memory bank to store high-confidence vegetation prototypes. NDVI is innovatively decoupled as a physics-guided gating signal to regulate memory writing, enabling cross-regional semantic reuse through memory-mediated cross-attention. A momentum update mechanism combined with a boundary-aware fusion strategy enhances recognition of spatially scattered yet spectrally similar vegetation while preserving the purity of the RGB backbone. Evaluated on a self-collected Chengdu dataset and the ISPRS Potsdam subset, our method achieves mIoU/mDice scores of 89.25%/94.31% and 92.17%/95.92%, respectively, significantly outperforming the SegFormer-B4 baseline.
📝 Abstract
Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.