Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models struggle to simultaneously maintain subject consistency and text alignment in multi-image generation; existing approaches rely on fine-tuning or image conditioning, incurring high computational costs and poor generalization. This paper proposes a training-free geometric disentanglement method: for the first time, it leverages the geometric structure of the text embedding space, explicitly decoupling shared subject representations from scene descriptions via token-level embedding rescaling and semantic suppression—thereby mitigating cross-frame semantic leakage. The method is plug-and-play and requires only a single text prompt. Experiments demonstrate substantial improvements across multiple benchmarks: subject consistency increases by 32.7% (ID preservation rate) and text alignment accuracy improves by +0.18 in CLIP-Score, surpassing state-of-the-art methods including 1Prompt1Story.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Preserving subject consistency across multiple text-to-image outputs
Eliminating semantic leakage and text misalignment in embeddings
Achieving training-free subject-consistent generation without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refines text embeddings geometrically to suppress unwanted semantics
Training-free approach enhances subject consistency and text alignment
Addresses semantic entanglement without per-subject optimization or fine-tuning
🔎 Similar Papers
No similar papers found.
S
Shangxun Li
Yonsei University
Youngjung Uh
Youngjung Uh
Yonsei University
Generative models