Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge in high-speed waste sorting where high-resolution RGB images lack spectral discriminability and low-resolution hyperspectral images (HSI) suffer from insufficient spatial detail. To overcome this, the authors propose a Bidirectional Cross-Attention Fusion (BCAF) mechanism that aligns RGB and HSI modalities natively on their original grids, enabling fusion at native resolutions without pre-upsampling–induced spectral degradation. The framework employs a Swin Transformer for RGB processing and introduces a 3D tokenized Swin architecture to preserve spectral self-attention in HSI, facilitating efficient and complementary multimodal fusion. Experiments demonstrate state-of-the-art performance with 76.4% mIoU at 31 FPS on SpectralWaste, and on the newly introduced industrial dataset K3I-Cycling, achieve 62.3% and 66.2% mIoU for material and plastic subcategories, respectively, while also showing generalizability to other low-resolution multi-channel sensors.

Technology Category

Application Category

📝 Abstract

Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).

Problem

Research questions and friction points this paper is trying to address.

waste sorting

RGB-HSI fusion

multimodal segmentation

hyperspectral imaging

material identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Cross-Attention

RGB-HSI Fusion

Swin Transformer