🤖 AI Summary
This work addresses the challenge of simultaneously preserving structural fidelity in hand-drawn sketches and generating photorealistic images. We propose a fine-tuning-free sketch-to-image synthesis method leveraging pre-trained diffusion models. Our core innovation is a lightweight, learnable linear mapping network that aligns sketch features with the image latent space of a frozen latent diffusion model (LDM), augmented by CLIP embeddings as spatial guidance. This design eliminates the need for model retraining or domain-specific training data. Quantitatively, our approach achieves state-of-the-art performance across multi-scale metrics—including FID, LPIPS, and Sketch-CLIP—and delivers superior high-resolution visual quality. It outperforms both GAN-based and fine-tuned diffusion-based methods, marking the first zero-shot, lightweight, and high-fidelity sketch-driven image generation framework.
📝 Abstract
Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.