🤖 AI Summary
Existing text-to-image methods first generate images and then apply post-hoc color transfer, often resulting in semantic-color misalignment. This work proposes a training-free, color-conditioned diffusion generation framework that directly aligns the generated image’s color distribution with that of a reference image during sampling, while strictly preserving text semantic fidelity. Our key innovation is the first integration of a differentiable Sliced 1-Wasserstein Distance into diffusion sampling, enabling distribution-level color control via joint gradient guidance and histogram matching. The method operates solely on off-the-shelf pre-trained diffusion models—requiring no fine-tuning or additional training. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both color similarity (measured against reference palettes) and text-image alignment (assessed via CLIP score and human evaluation), achieving semantically coherent and chromatically accurate controllable image synthesis.
📝 Abstract
We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at https://github.com/alobashev/sw-guidance/.