VICR: Visual In-Context Restoration for Real-World Image Super-Resolution

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Real-world image super-resolution faces the challenge of simultaneously preserving structural fidelity and generating realistic details, as existing methods often employ coupled conditional mechanisms that lead to structural distortions or semantic inconsistencies. This work reframes the task as an image inpainting problem and introduces a decoupled local–global visual prior injection mechanism within a diffusion Transformer (DiT) architecture, enabling separate capture of fine-grained and semantic-level cues from low-quality inputs to jointly restore structure and synthesize coherent details. Furthermore, an inference-time agent is incorporated to dynamically refine semantic prompts based on input visual evidence, without requiring model parameter updates. With only 127 million trainable parameters, the proposed method achieves state-of-the-art performance across multiple real-world super-resolution benchmarks.

📝 Abstract

Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.

Problem

Research questions and friction points this paper is trying to address.

Real-world image super-resolution

structural fidelity

semantic consistency

detail synthesis

degraded observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual In-Context Restoration

Decoupled Visual Prior

Diffusion Transformer