🤖 AI Summary
Existing RGB-T salient object detection methods face two key bottlenecks: (1) Transformer-based architectures incur excessive computational overhead, hindering efficient high-resolution dual-modality fusion; and (2) insufficient frequency-domain modeling capability leads to high-frequency detail misalignment between predictions and ground truth. To address these, we propose DFENet—a lightweight, purely Fourier-driven network featuring the first fully frequency-domain backbone, eliminating all spatial-domain convolutions and self-attention mechanisms. Our core innovations include: modality-cooperative perception attention, frequency-decomposition edge-aware module, Fourier residual channel attention, and a cofocal frequency loss. All operations—including feature decomposition, enhancement, and cross-modal alignment—are performed efficiently via Fast Fourier Transform (FFT). DFENet achieves state-of-the-art performance across four major RGB-T benchmarks, significantly improving both accuracy and inference efficiency on high-resolution inputs. The code is publicly available.
📝 Abstract
The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing deep learning-based RGB-T SOD models suffer from two major limitations. First, Transformer-based models with quadratic complexity are computationally expensive and memory-intensive, limiting their application in high-resolution bi-modal feature fusion. Second, even when these models converge to an optimal solution, there remains a frequency gap between the prediction and ground-truth. To overcome these limitations, we propose a purely Fourier transform-based model, namely Deep Fourier-Embedded Network (DFENet), for accurate RGB-T SOD. To address the computational complexity when dealing with high-resolution images, we leverage the efficiency of fast Fourier transform with linear complexity to design three key components: (1) the Modal-coordinated Perception Attention, which fuses RGB and thermal modalities with enhanced multi-dimensional representation; (2) the Frequency-decomposed Edge-aware Block, which clarifies object edges by deeply decomposing and enhancing frequency components of low-level features; and (3) the Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. To mitigate the frequency gap, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bi-modal edge information in the Fourier domain. Extensive experiments on four RGB-T SOD benchmark datasets demonstrate that DFENet outperforms fifteen existing state-of-the-art RGB-T SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/DFENet.