🤖 AI Summary
To address modality misalignment, high computational overhead, and poor real-time deployability in RGB-T salient object detection (SOD) for unmanned aerial vehicle (UAV) platforms, this paper proposes a lightweight and efficient dual-modal detection framework. Methodologically: (i) we introduce a novel parameter-free semantic contrastive alignment loss to enable cross-modal semantic-level collaboration; (ii) we design an FFT-inspired synchronized alignment fusion mechanism that jointly aligns and fuses features across both channel and spatial dimensions; and (iii) we adopt a lightweight network architecture. Evaluated on eight benchmarks—including UAV RGB-T 2400—our method achieves state-of-the-art performance: compared with the SOTA model MROS, it reduces model parameters by 70.0%, decreases FLOPs by 49.4%, and accelerates inference speed by 152.5%, while maintaining superior generalization capability and real-time efficiency.
📝 Abstract
Unmanned aerial vehicle (UAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing UAV-based BSOD models limits their applicability to real-world UAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the UAV RGB-T 2400 and seven bi-modal dense prediction datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to nineteen state-of-the-art models across most evaluation metrics. In addition, our ablation studies further verify AlignSal's potential in boosting the performance of existing aligned BSOD models on UAV-based unaligned data. The code is available at: https://github.com/JoshuaLPF/AlignSal.