Efficient Fourier Filtering Network with Contrastive Learning for UAV-based Unaligned Bi-modal Salient Object Detection

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address modality misalignment, high computational overhead, and poor real-time deployability in RGB-T salient object detection (SOD) for unmanned aerial vehicle (UAV) platforms, this paper proposes a lightweight and efficient dual-modal detection framework. Methodologically: (i) we introduce a novel parameter-free semantic contrastive alignment loss to enable cross-modal semantic-level collaboration; (ii) we design an FFT-inspired synchronized alignment fusion mechanism that jointly aligns and fuses features across both channel and spatial dimensions; and (iii) we adopt a lightweight network architecture. Evaluated on eight benchmarks—including UAV RGB-T 2400—our method achieves state-of-the-art performance: compared with the SOTA model MROS, it reduces model parameters by 70.0%, decreases FLOPs by 49.4%, and accelerates inference speed by 152.5%, while maintaining superior generalization capability and real-time efficiency.

Technology Category

Application Category

📝 Abstract
Unmanned aerial vehicle (UAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing UAV-based BSOD models limits their applicability to real-world UAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the UAV RGB-T 2400 and seven bi-modal dense prediction datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to nineteen state-of-the-art models across most evaluation metrics. In addition, our ablation studies further verify AlignSal's potential in boosting the performance of existing aligned BSOD models on UAV-based unaligned data. The code is available at: https://github.com/JoshuaLPF/AlignSal.
Problem

Research questions and friction points this paper is trying to address.

Efficient bi-modal salient object detection for UAVs
Reducing computational cost in RGB-thermal image processing
Aligning unaligned RGB and thermal image pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic contrastive alignment loss for modality alignment
Synchronized alignment fusion with Fourier filtering
Hierarchical filtering mechanism for efficient feature fusion
🔎 Similar Papers
No similar papers found.
Pengfei Lyu
Pengfei Lyu
Ph.D student at Northeastern University
Machine LearningComputer visionMulti-modal image processing
P
P. Yeung
College of Computing and Data Science, Nanyang Technological University, 639798 Singapore
X
Xiufei Cheng
Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169 China
X
Xiaosheng Yu
Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169 China
C
Chengdong Wu
Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169 China
J
Jagath C. Rajapakse
College of Computing and Data Science, Nanyang Technological University, 639798 Singapore