🤖 AI Summary
To address the limitation of multimodal scene understanding under low-visibility conditions—specifically, the scarcity of short-wave infrared (SWIR) data—this paper proposes a novel trimodal (RGB + LWIR + synthetic SWIR) fusion method that requires no real SWIR imagery. The core innovation lies in a modality-specific encoder coupled with a softmax-gated fusion head, together with a contrast-enhancement-driven LWIR-to-SWIR feature synthesis mechanism that yields synthetic SWIR representations with high structural fidelity and superior material discriminability. Under a unified evaluation protocol, the method achieves significant improvements across multiple public and private benchmarks: +12.3% in contrast, +9.7% in structural similarity (SSIM), and enhanced edge sharpness—while maintaining real-time inference capability. Comprehensive experiments demonstrate consistent superiority over state-of-the-art dual- and trimodal baselines.
📝 Abstract
Enhancing scene understanding in adverse visibility conditions remains a critical challenge for surveillance and autonomous navigation systems. Conventional imaging modalities, such as RGB and thermal infrared (MWIR / LWIR), when fused, often struggle to deliver comprehensive scene information, particularly under conditions of atmospheric interference or inadequate illumination. To address these limitations, Short-Wave Infrared (SWIR) imaging has emerged as a promising modality due to its ability to penetrate atmospheric disturbances and differentiate materials with improved clarity. However, the advancement and widespread implementation of SWIR-based systems face significant hurdles, primarily due to the scarcity of publicly accessible SWIR datasets. In response to this challenge, our research introduces an approach to synthetically generate SWIR-like structural/contrast cues (without claiming spectral reproduction) images from existing LWIR data using advanced contrast enhancement techniques. We then propose a multimodal fusion framework integrating synthetic SWIR, LWIR, and RGB modalities, employing an optimized encoder-decoder neural network architecture with modality-specific encoders and a softmax-gated fusion head. Comprehensive experiments on public {RGB-LWIR benchmarks (M3FD, TNO, CAMEL, MSRS, RoadScene) and an additional private real RGB-MWIR-SWIR dataset} demonstrate that our synthetic-SWIR-enhanced fusion framework improves fused-image quality (contrast, edge definition, structural fidelity) while maintaining real-time performance. We also add fair trimodal baselines (LP, LatLRR, GFF) and cascaded trimodal variants of U2Fusion/SwinFusion under a unified protocol. The outcomes highlight substantial potential for real-world applications in surveillance and autonomous systems.