From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Image deblurring faces challenges including complex motion blur, difficulty in recovering fine details at high resolutions, and high computational overhead. This paper proposes a spatial-frequency dual-domain collaborative deblurring network. It innovatively integrates Vision Transformers (ViTs) with a learnable Fourier-domain FFT-ReLU module: ViTs model long-range spatial dependencies to capture global blur patterns, while the FFT-ReLU module introduces sparse, learnable nonlinearity in the frequency domain—explicitly regularizing frequency responses to suppress blur artifacts and preserve high-frequency details. This design establishes an organic bridge between spatial attention mechanisms and frequency-domain sparsity. The method achieves state-of-the-art performance on multiple benchmarks (e.g., GoPro, HIDE), with significant PSNR/SSIM improvements. Comprehensive evaluations—including quantitative metrics, qualitative analysis, and human perceptual assessment—demonstrate its superior visual quality and perceptual fidelity.

Technology Category

Application Category

📝 Abstract

Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.

Problem

Research questions and friction points this paper is trying to address.

Recovering sharp images from blurry inputs caused by motion or camera shake

Addressing limitations of CNNs and Vision Transformers with complex blur

Bridging spatial attention modeling and frequency sparsity for deblurring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer captures local and global dependencies

FFT-ReLU enforces frequency-domain sparsity for artifacts

Dual-domain architecture bridges spatial attention and frequency processing

🔎 Similar Papers

DeblurDiNAT: A Compact Model with Exceptional Generalization and Visual Fidelity on Unseen Domains