🤖 AI Summary
This work addresses the scarcity of pixel-level annotations in industrial inspection and the systematic noise in pseudo-masks generated by Segment Anything Model (SAM) from bounding boxes, which often leads to false positives on background regions or missed sparse defects. To tackle this, the authors propose a noise-robust box-to-pixel distillation framework that treats SAM as a noisy teacher model to generate offline pseudo-masks and trains a lightweight student model for weakly supervised defect segmentation. The approach incorporates a hierarchical decoder with an auxiliary binary localization head to decouple foreground discovery from classification and introduces a unidirectional online self-correction mechanism to mitigate the teacher’s false negatives. Evaluated on a wind turbine inspection benchmark, the method achieves significant gains: +6.97 in anomaly mIoU, +9.71 in binary IoU, and +18.56 in recall, while reducing trainable parameters by 80%.
📝 Abstract
Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects.
To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives.
On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80\% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.