ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

๐Ÿ“… 2025-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Text-to-image diffusion models frequently suffer from object fusion or omission in multi-instance generation; existing training-free methods rely solely on semantic-level prompts and lack the capacity for instance-level modeling. This paper proposes a training-free instance-to-semantic attention control mechanism. First, it constructs an instance-prioritized, tree-structured hierarchical prompt to explicitly encode object-level relationships. Second, it refines self-attention patterns and recalibrates cross-attention weights to achieve fine-grained instance disentanglement and precise semantic alignment. The method requires no external models, fine-tuning, or additional training. It overcomes the fundamental limitation of semantic guidanceโ€”its inability to distinguish homogeneous instances. Experiments demonstrate substantial improvements: multi-class accuracy reaches 52% and multi-instance accuracy attains 83%, significantly enhancing both completeness and consistency in multi-object generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-instance generation in text-to-image diffusion models
Resolving incomplete instance formation and semantic entanglement
Disentangling multiple object instances without external models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free instance-first modeling approach
Hierarchical tree-structured prompt mechanism
Disentangling instances without external models
๐Ÿ”Ž Similar Papers
No similar papers found.