🤖 AI Summary
This study addresses the challenges of remote sensing semantic segmentation, which relies on costly pixel-level annotations and suffers from poor generalization across sensors, platforms, or geographic regions. The authors propose a concise and efficient human-in-the-loop framework that requires only sparse expert clicks on high-confidence erroneous regions predicted by the model. By incorporating an error-weighted loss, the method directly optimizes the model without relying on pseudo-labeling, propagation algorithms, or other auxiliary mechanisms. Remarkably, this approach demonstrates for the first time that sparse click supervision can nearly match full supervision performance: achieving 74.79% mIoU (97.2% of the fully supervised result) on BsB Aerial with just 0.040% of pixels annotated, and 76.78% mIoU on ISPRS Vaihingen using only 0.011% of pixels—surpassing existing methods and matching the fully supervised baseline. The annotation process inherently yields a scalable, correctable, and auditable dataset.
📝 Abstract
Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.