d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously preserving structural fidelity in hand-drawn sketches and generating photorealistic images. We propose a fine-tuning-free sketch-to-image synthesis method leveraging pre-trained diffusion models. Our core innovation is a lightweight, learnable linear mapping network that aligns sketch features with the image latent space of a frozen latent diffusion model (LDM), augmented by CLIP embeddings as spatial guidance. This design eliminates the need for model retraining or domain-specific training data. Quantitatively, our approach achieves state-of-the-art performance across multi-scale metrics—including FID, LPIPS, and Sketch-CLIP—and delivers superior high-resolution visual quality. It outperforms both GAN-based and fine-tuned diffusion-based methods, marking the first zero-shot, lightweight, and high-fidelity sketch-driven image generation framework.

Technology Category

Application Category

📝 Abstract
Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.
Problem

Research questions and friction points this paper is trying to address.

Enhances sketch-to-image translation fidelity
Utilizes pretrained models without retraining
Balances shape consistency and realistic generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes pretrained latent diffusion models
Employs lightweight mapping network
Enhances sketch-to-image translation fidelity
🔎 Similar Papers
No similar papers found.