Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

High-resolution image synthesis faces a fundamental trade-off between computational efficiency and fine-grained detail fidelity. This paper proposes a lightweight, zero-overhead frequency-aware framework that enables arbitrary latent diffusion models to synthesize 2K–4K ultra-high-resolution images without architectural modification or additional parameters. Our method operates entirely in the latent space and introduces two key innovations: (1) a novel wavelet energy map to precisely localize high-frequency detail regions in images; and (2) a scale-consistent VAE reconstruction objective jointly optimized with time-varying high-frequency masking supervision, thereby enhancing spectral fidelity in the latent space. On 4K generation tasks, our approach consistently improves perceptual quality—evidenced by progressive FID reduction—outperforming strong baselines. Crucially, it is fully compatible with existing diffusion models and requires no retraining or fine-tuning, enabling plug-and-play deployment.

Technology Category

Application Category

📝 Abstract

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.

Problem

Research questions and friction points this paper is trying to address.

Balancing computational efficiency with fine-grained visual detail in high-resolution image synthesis

Enabling ultra-high-resolution image generation (2K to 4K) without additional computational overhead

Improving perceptual quality and reducing FID in ultra-high-resolution image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-consistent VAE enhances spectral fidelity

Wavelet energy maps localize detail-rich regions

Time-dependent masking focuses on high-frequency denoising

🔎 Similar Papers

No similar papers found.

Authors to Follow