🤖 AI Summary
In real-world image super-resolution, diffusion models suffer from insufficient structural detail recovery due to reliance on a single, fixed forward noise distribution. To address this, we propose a zero-overhead structure-aware diffusion super-resolution method. Our core innovation is the first integration of fine-grained semantic structural priors—extracted from the Segment Anything Model (SAM)—as implicit guidance to dynamically modulate the mean of the forward diffusion noise in a region-adaptive manner. During training, structural cues explicitly steer denoising; crucially, at inference time, the model operates entirely without SAM, incurring no additional computational cost. Evaluated on DIV2K, our method achieves a +0.74 dB PSNR gain over the strongest baseline, significantly suppresses artifacts (e.g., blurring and checkerboard patterns), and outperforms all existing diffusion-based super-resolution approaches in both fidelity and perceptual quality.
📝 Abstract
Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to handle real-world scenes and complex textures across semantic regions. With the success of segment anything model (SAM), generating sufficiently fine-grained region masks can enhance the detail recovery of diffusion-based SR model. However, directly integrating SAM into SR models will result in much higher computational cost. In this paper, we propose the SAM-DiffSR model, which can utilize the fine-grained structure information from SAM in the process of sampling noise to improve the image quality without additional computational cost during inference. In the process of training, we encode structural position information into the segmentation mask from SAM. Then the encoded mask is integrated into the forward diffusion process by modulating it to the sampled noise. This adjustment allows us to independently adapt the noise mean within each corresponding segmentation area. The diffusion model is trained to estimate this modulated noise. Crucially, our proposed framework does NOT change the reverse diffusion process and does NOT require SAM at inference. Experimental results demonstrate the effectiveness of our proposed method, showcasing superior performance in suppressing artifacts, and surpassing existing diffusion-based methods by 0.74 dB at the maximum in terms of PSNR on DIV2K dataset. The code and dataset are available at https://github.com/lose4578/SAM-DiffSR.