🤖 AI Summary
This work addresses the limitations of existing unsupervised building change detection methods, which often rely on generic temporal differences and are thus susceptible to appearance variations, registration errors, and non-building interferences, making it difficult to capture genuine structural changes. To overcome these challenges, the authors propose the SST-CD framework, which reformulates the task as end-to-end learning under noisy pseudo-supervision, trained exclusively on spatially reliable pixels. High-quality pseudo-labels are selected via a local consistency criterion, and robust change modeling is achieved through a lightweight feature adapter coupled with a prototype decoder. The method achieves F1 scores of 83.08%, 91.69%, and 86.60% on the LEVIR-CD, WHU-CD, and DSIFN-CD benchmarks, respectively, significantly outperforming current unsupervised and label-free approaches.
📝 Abstract
Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision.
SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08\%, 91.69\%, and 86.60\%, respectively, outperforming existing unsupervised and label-free baselines. Code will be made publicly available.