๐ค AI Summary
Text-to-image diffusion models frequently suffer from object fusion or omission in multi-instance generation; existing training-free methods rely solely on semantic-level prompts and lack the capacity for instance-level modeling. This paper proposes a training-free instance-to-semantic attention control mechanism. First, it constructs an instance-prioritized, tree-structured hierarchical prompt to explicitly encode object-level relationships. Second, it refines self-attention patterns and recalibrates cross-attention weights to achieve fine-grained instance disentanglement and precise semantic alignment. The method requires no external models, fine-tuning, or additional training. It overcomes the fundamental limitation of semantic guidanceโits inability to distinguish homogeneous instances. Experiments demonstrate substantial improvements: multi-class accuracy reaches 52% and multi-instance accuracy attains 83%, significantly enhancing both completeness and consistency in multi-object generation.
๐ Abstract
Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.