🤖 AI Summary
Text-to-image (T2I) models still struggle to accurately model spatial relations—e.g., “above” or “to the right”—and existing positional control methods are incompatible with dominant diffusion Transformer architectures due to their reliance on external modules. This paper introduces a training-free, general-purpose positional control method: during generation, it automatically identifies and isolates attention heads corresponding to each object, generates semantically aligned bounding boxes, and concatenates object features in feature space to enforce precise spatial layouts. To our knowledge, this is the first method enabling plug-and-play positional control in multimodal diffusion Transformers—including Qwen-Image, FLUX, and SD3.5—without architectural modification or fine-tuning. Evaluated on our newly constructed PosEval benchmark, FLUX achieves a 206% improvement over baseline; Qwen-Image attains state-of-the-art performance, surpassing prior approaches by 54%.
📝 Abstract
Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.