🤖 AI Summary
Text-to-image diffusion models frequently exhibit color–object semantic misalignment—particularly under multi-object, multi-color prompts—where existing methods fail to achieve fine-grained color–object correspondence. To address this, we propose the first color-anchored, attention-editing technique for diffusion models: leveraging CLIP embeddings to localize color-relevant attention regions, then selectively reweighting cross-attention maps to enable object-level color semantic calibration. Our method requires no fine-tuning or additional training and is fully plug-and-play. Evaluated on a multi-color prompt benchmark, it achieves substantial improvements in color accuracy (+18.7%) and object–color alignment (+22.3%). Extensive experiments confirm strong generalization and cross-architecture effectiveness across mainstream models—including Stable Diffusion v1.5 and SDXL—demonstrating robustness without architectural modification.
📝 Abstract
Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.