🤖 AI Summary
Existing vector quantization (VQ) methods for generative super-resolution suffer from two key limitations: (1) coarse-grained quantization of all visual features, leading to significant texture degradation and high quantization distortion; and (2) index predictors trained solely with codebook-level supervision, neglecting end-to-end reconstruction fidelity. To address these, we propose Texture Vector Quantization (TVQ) and Reconstruction-Aware Prediction (RAP). TVQ selectively models only high-frequency missing textures—bypassing low-frequency content—thereby substantially reducing quantization artifacts. RAP incorporates a straight-through estimator to enable end-to-end, image-level supervision, directly optimizing index prediction for reconstruction quality. Our approach achieves marked improvements in texture realism and structural consistency of super-resolved images with minimal computational overhead. Extensive experiments demonstrate state-of-the-art performance over prevailing VQ-based methods across multiple benchmarks, validating the synergistic enhancement of prior modeling accuracy and final reconstruction quality.
📝 Abstract
Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.