ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Text-to-image diffusion models frequently suffer from object fusion or omission in multi-instance generation; existing training-free methods rely solely on semantic-level prompts and lack the capacity for instance-level modeling. This paper proposes a training-free instance-to-semantic attention control mechanism. First, it constructs an instance-prioritized, tree-structured hierarchical prompt to explicitly encode object-level relationships. Second, it refines self-attention patterns and recalibrates cross-attention weights to achieve fine-grained instance disentanglement and precise semantic alignment. The method requires no external models, fine-tuning, or additional training. It overcomes the fundamental limitation of semantic guidance—its inability to distinguish homogeneous instances. Experiments demonstrate substantial improvements: multi-class accuracy reaches 52% and multi-instance accuracy attains 83%, significantly enhancing both completeness and consistency in multi-object generation.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-instance generation in text-to-image diffusion models

Resolving incomplete instance formation and semantic entanglement

Disentangling multiple object instances without external models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free instance-first modeling approach

Hierarchical tree-structured prompt mechanism

Disentangling instances without external models

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models