Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in high-speed waste sorting where high-resolution RGB images lack spectral discriminability and low-resolution hyperspectral images (HSI) suffer from insufficient spatial detail. To overcome this, the authors propose a Bidirectional Cross-Attention Fusion (BCAF) mechanism that aligns RGB and HSI modalities natively on their original grids, enabling fusion at native resolutions without pre-upsampling–induced spectral degradation. The framework employs a Swin Transformer for RGB processing and introduces a 3D tokenized Swin architecture to preserve spectral self-attention in HSI, facilitating efficient and complementary multimodal fusion. Experiments demonstrate state-of-the-art performance with 76.4% mIoU at 31 FPS on SpectralWaste, and on the newly introduced industrial dataset K3I-Cycling, achieve 62.3% and 66.2% mIoU for material and plastic subcategories, respectively, while also showing generalizability to other low-resolution multi-channel sensors.

Technology Category

Application Category

📝 Abstract
Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).
Problem

Research questions and friction points this paper is trying to address.

waste sorting
RGB-HSI fusion
multimodal segmentation
hyperspectral imaging
material identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Cross-Attention
RGB-HSI Fusion
Swin Transformer
3D Tokenization
Multimodal Segmentation
J
Jonas V. Funk
KIT, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany; Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, 76131 Karlsruhe, Germany
L
Lukas Roming
Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, 76131 Karlsruhe, Germany
A
Andreas Michel
Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, 76131 Karlsruhe, Germany
P
Paul Bäcker
Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, 76131 Karlsruhe, Germany
Georg Maier
Georg Maier
Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung (IOSB)
Image ProcessingMachine VisionSensor-based Sorting
T
Thomas Längle
KIT, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany; Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, 76131 Karlsruhe, Germany
Markus Klute
Markus Klute
Assistant Professor of Physics, Massachusetts Institute of Technology
Particle Physics