🤖 AI Summary
Real-world image super-resolution faces the challenge of simultaneously preserving structural fidelity and generating realistic details, as existing methods often employ coupled conditional mechanisms that lead to structural distortions or semantic inconsistencies. This work reframes the task as an image inpainting problem and introduces a decoupled local–global visual prior injection mechanism within a diffusion Transformer (DiT) architecture, enabling separate capture of fine-grained and semantic-level cues from low-quality inputs to jointly restore structure and synthesize coherent details. Furthermore, an inference-time agent is incorporated to dynamically refine semantic prompts based on input visual evidence, without requiring model parameter updates. With only 127 million trainable parameters, the proposed method achieves state-of-the-art performance across multiple real-world super-resolution benchmarks.
📝 Abstract
Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.