CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) diffusion models generate high-fidelity images but often fail to accurately model spatial relationships among objects—primarily due to ambiguous spatial annotations in training data and the lack of explicit spatial reasoning capability in standard text encoders. To address this, we propose SCOP, a data engine that constructs high-quality, spatially grounded image-text pairs with precise spatial constraints. We further introduce TENOR, a plug-and-play module that explicitly encodes spatial logic by reordering text token embeddings—compatible with both UNet and MMDiT backbones without architectural modification. Our approach is model-agnostic and requires no changes to the base diffusion architecture. Evaluated on three spatial reasoning benchmarks—VISOR, T2I-CompBench Spatial, and GenEval Position—our method achieves relative improvements of 98%, 67%, and 131%, respectively, establishing new state-of-the-art performance across four major open-source T2I models, including Stable Diffusion and SDXL.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code will be available at https://github.com/blurgyy/CoMPaSS.
Problem

Research questions and friction points this paper is trying to address.

Addressing ambiguous spatial data in text-to-image datasets
Improving text encoders' spatial semantics interpretation
Enhancing spatial relationship accuracy in diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

SCOP data engine curates spatially-accurate training data
TENOR module preserves token ordering information
Framework enhances spatial understanding in diffusion models
🔎 Similar Papers
No similar papers found.
G
Gaoyang Zhang
State Key Lab of CAD&CG, Zhejiang University
Bingtao Fu
Bingtao Fu
Ant Group
Computer Vision
Qingnan Fan
Qingnan Fan
Lead researcher @ VIVO | Prev Tencent, Stanford, SDU
Diffusion models3D VisionComputer Graphics
Q
Qi Zhang
vivo Mobile Communication Co. Ltd
R
Runxing Liu
vivo Mobile Communication Co. Ltd
Hong Gu
Hong Gu
National Institute on Drug Abuse, NIH
functional MRIfunctional connectivitydrug addiction
H
Huaqi Zhang
vivo Mobile Communication Co. Ltd
X
Xinguo Liu
State Key Lab of CAD&CG, Zhejiang University