Structured Information for Improving Spatial Relationships in Text-to-Image Generation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In text-to-image (T2I) generation, accurately modeling spatial relationships described in natural language prompts remains challenging. This paper proposes a lightweight, portable structured information injection method: fine-tuning a language model to automatically parse raw prompts into semantic tuples explicitly encoding spatial relations, then seamlessly integrating these tuples into mainstream T2I pipelines. The approach eliminates manual prompt engineering—automatically generated tuples achieve quality comparable to hand-crafted ones. Crucially, it enhances spatial layout accuracy without degrading overall image fidelity. Experiments demonstrate substantial improvements over baselines on spatial relationship evaluation metrics, while maintaining stable Inception Score—confirming the method’s effectiveness, robustness, and plug-and-play compatibility with existing diffusion-based or autoregressive T2I frameworks.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) generation has advanced rapidly, yet faithfully capturing spatial relationships described in natural language prompts remains a major challenge. Prior efforts have addressed this issue through prompt optimization, spatially grounded generation, and semantic refinement. This work introduces a lightweight approach that augments prompts with tuple-based structured information, using a fine-tuned language model for automatic conversion and seamless integration into T2I pipelines. Experimental results demonstrate substantial improvements in spatial accuracy, without compromising overall image quality as measured by Inception Score. Furthermore, the automatically generated tuples exhibit quality comparable to human-crafted tuples. This structured information provides a practical and portable solution to enhance spatial relationships in T2I generation, addressing a key limitation of current large-scale generative systems.
Problem

Research questions and friction points this paper is trying to address.

Improving spatial relationship accuracy in text-to-image generation
Addressing limitations in capturing spatial language prompts
Enhancing spatial relationships without compromising image quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments prompts with tuple-based structured information
Uses fine-tuned language model for automatic conversion
Seamless integration into text-to-image generation pipelines
🔎 Similar Papers
No similar papers found.
S
Sander Schildermans
KU Leuven, Leuven, Belgium
C
Chang Tian
KU Leuven, Leuven, Belgium
Y
Ying Jiao
KU Leuven, Leuven, Belgium
Marie-Francine Moens
Marie-Francine Moens
Professor of Computer Science KU Leuven
Natural language processing and understandingmachine learninginformation retrievalmultimedia