🤖 AI Summary
Existing appearance transfer methods struggle to model cross-image semantic correspondences, leading to structural distortions and misaligned colorization. To address this, we propose the first diffusion-based framework that explicitly models semantic alignment—abandoning the self-attention similarity assumption—and achieves structure-appearance disentanglement on unregistered inputs via semantic correspondence estimation and feature rearrangement. Our method fine-tunes SDXL with a CLIP-guided semantic alignment constraint. Experiments demonstrate substantial improvements: a 37% increase in structural preservation rate and a 52% gain in semantic-region colorization accuracy, significantly outperforming state-of-the-art approaches including Palette and ATM. The core contribution lies in the first explicit modeling of cross-image semantic correspondences within diffusion models, establishing a novel paradigm for registration-free appearance transfer.
📝 Abstract
As pretrained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. In this paper, we introduce a method to produce results with the same structure of a target image but painted with colors from a reference image, i.e., appearance transfer, especially following the semantic correspondence between the result and the reference. E.g., the result wing takes color from the reference wing, not the reference head. Existing methods rely on the query-key similarity within self-attention layer, usually producing defective results. To this end, we propose to find semantic correspondences and explicitly rearrange the features according to the semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the color from the reference according to the semantic correspondences, even when the two images are not aligned.