🤖 AI Summary
In high-resolution image restoration, simultaneously modeling global contextual dependencies and preserving local details remains challenging, while conventional self-attention mechanisms suffer from prohibitive computational complexity. To address this, we propose Dilated Neighborhood Attention (DiNA), a channel-aware attention mechanism that employs multi-scale dilated sliding windows for fine-grained local modeling and channel-wise global context aggregation. DiNA significantly reduces computational overhead while enhancing long-range dependency capture. Integrated into a lightweight Transformer architecture, it synergistically combines channel-wise attention, dynamic sliding windows, and a hybrid dilation strategy, augmented by a channel-aware module for adaptive multi-scale feature fusion. Evaluated on multiple image deblurring and restoration benchmarks, DiNA achieves state-of-the-art or near-state-of-the-art performance with substantially fewer parameters and lower FLOPs, thereby unifying high-fidelity reconstruction with high inference efficiency.
📝 Abstract
Transformers, with their self-attention mechanisms for modeling long-range dependencies, have become a dominant paradigm in image restoration tasks. However, the high computational cost of self-attention limits scalability to high-resolution images, making efficiency-quality trade-offs a key research focus. To address this, Restormer employs channel-wise self-attention, which computes attention across channels instead of spatial dimensions. While effective, this approach may overlook localized artifacts that are crucial for high-quality image restoration. To bridge this gap, we explore Dilated Neighborhood Attention (DiNA) as a promising alternative, inspired by its success in high-level vision tasks. DiNA balances global context and local precision by integrating sliding-window attention with mixed dilation factors, effectively expanding the receptive field without excessive overhead. However, our preliminary experiments indicate that directly applying this global-local design to the classic deblurring task hinders accurate visual restoration, primarily due to the constrained global context understanding within local attention. To address this, we introduce a channel-aware module that complements local attention, effectively integrating global context without sacrificing pixel-level precision. The proposed DiNAT-IR, a Transformer-based architecture specifically designed for image restoration, achieves competitive results across multiple benchmarks, offering a high-quality solution for diverse low-level computer vision problems.