🤖 AI Summary
High-resolution image synthesis faces a fundamental trade-off between computational efficiency and fine-grained detail fidelity. This paper proposes a lightweight, zero-overhead frequency-aware framework that enables arbitrary latent diffusion models to synthesize 2K–4K ultra-high-resolution images without architectural modification or additional parameters. Our method operates entirely in the latent space and introduces two key innovations: (1) a novel wavelet energy map to precisely localize high-frequency detail regions in images; and (2) a scale-consistent VAE reconstruction objective jointly optimized with time-varying high-frequency masking supervision, thereby enhancing spectral fidelity in the latent space. On 4K generation tasks, our approach consistently improves perceptual quality—evidenced by progressive FID reduction—outperforming strong baselines. Crucially, it is fully compatible with existing diffusion models and requires no retraining or fine-tuning, enabling plug-and-play deployment.
📝 Abstract
High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.