Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-guided diffusion models (TGDMs) based on Transformers frequently suffer from text–image semantic misalignment, particularly in complex prompting or multi-concept attribute-binding tasks. To address this, we propose a training-free Self-Coherence Guidance mechanism: it dynamically recalibrates cross-attention maps using attention masks derived from preceding denoising steps, thereby enhancing text–image alignment without fine-tuning. Specifically designed for Transformer-based architectures, our method circumvents the failure of U-Net-oriented optimization strategies when transferred to Transformers. Evaluated on a newly constructed benchmark encompassing coarse-/fine-grained attribute binding and style binding, our approach consistently outperforms state-of-the-art methods, significantly improving both semantic consistency and structural fidelity of generated images. This work establishes a scalable, plug-and-play paradigm for alignment modeling in TGDMs.

Technology Category

Application Category

📝 Abstract
We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.
Problem

Research questions and friction points this paper is trying to address.

Enhancing alignment in Transformer-based Text-Guided Diffusion Models
Addressing multi-concept attribute binding challenges in image generation
Improving cross-attention maps for precise text-image alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free Transformer-based alignment enhancement
Dynamic Self-Coherence Guidance for attention maps
Optimized cross-attention without additional training
🔎 Similar Papers
No similar papers found.
Shulei Wang
Shulei Wang
Zhejiang University
multimodal learningcomputer visiondiffusion modal
Wang Lin
Wang Lin
Zhejiang University
Computer VisionMulti-Modal LearningVideo Understanding
H
Hai Huang
Zhejiang University
Hanting Wang
Hanting Wang
Zhejiang University
Image RestorationGenerative Modeling
S
Sihang Cai
Zhejiang University
W
WenKang Han
Zhejiang University
T
Tao Jin
Zhejiang University
J
Jingyuan Chen
Zhejiang University
J
Jiacheng Sun
Huawei Noah’s Ark Lab
J
Jieming Zhu
Huawei Noah’s Ark Lab
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing