🤖 AI Summary
This work addresses the challenge of building footprint extraction from high-resolution remote sensing imagery, where complex structures and varying imaging conditions often degrade performance. Conventional approaches typically rely on multi-stage post-processing pipelines, leading to low efficiency and error propagation. To overcome these limitations, the authors propose PolyBuild, an end-to-end model that directly generates vectorized building polygons without any post-processing—a first in the field. PolyBuild integrates a CNN-Transformer hybrid architecture: an initial contour generation module jointly performs detection and coarse outline extraction, while a Transformer-based decoder refines the contour by effectively fusing local details with global contextual information. Extensive experiments on three benchmark building datasets demonstrate that PolyBuild significantly outperforms state-of-the-art mask-based and contour-based methods, confirming its superior accuracy and robustness.
📝 Abstract
Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.