🤖 AI Summary
This work addresses the challenge in high-speed waste sorting where high-resolution RGB images lack spectral discriminability and low-resolution hyperspectral images (HSI) suffer from insufficient spatial detail. To overcome this, the authors propose a Bidirectional Cross-Attention Fusion (BCAF) mechanism that aligns RGB and HSI modalities natively on their original grids, enabling fusion at native resolutions without pre-upsampling–induced spectral degradation. The framework employs a Swin Transformer for RGB processing and introduces a 3D tokenized Swin architecture to preserve spectral self-attention in HSI, facilitating efficient and complementary multimodal fusion. Experiments demonstrate state-of-the-art performance with 76.4% mIoU at 31 FPS on SpectralWaste, and on the newly introduced industrial dataset K3I-Cycling, achieve 62.3% and 66.2% mIoU for material and plastic subcategories, respectively, while also showing generalizability to other low-resolution multi-channel sensors.
📝 Abstract
Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).