🤖 AI Summary
To address the limitations of convolutional models in capturing long-range dependencies and the high computational cost of attention mechanisms in high-resolution image inpainting, this paper pioneers the integration of Structured State Space Models (SSMs) into image restoration. We propose a multi-scale serialized hierarchical architecture that achieves both global receptive fields and linear computational complexity through multi-scale signal decomposition and serialized image representation. A hierarchical feature aggregation mechanism is designed to enable resolution-scalable, lightweight reconstruction. Experiments demonstrate that our method matches state-of-the-art (SOTA) performance in reconstruction quality while reducing FLOPs by up to 150× and GPU memory consumption by 5× on high-resolution inputs. This yields substantial improvements in inference efficiency and deployment friendliness without compromising fidelity.
📝 Abstract
The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5 imes$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.