Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how cross-modal attention in text-to-image diffusion models achieves semantic alignment between text tokens and image spatial regions, enabling open-vocabulary semantic segmentation. To this end, we propose Seg4Diff—a framework built upon a multimodal DiT architecture that systematically analyzes the dynamic propagation of cross-modal attention within diffusion Transformers. We empirically discover, for the first time, that high-quality, spatially coherent segmentation masks naturally emerge at specific intermediate layers, revealing semantic grouping as an intrinsic emergent property of diffusion Transformers. With only lightweight fine-tuning—namely, concatenating image and text tokens and introducing mask supervision—we significantly improve both segmentation accuracy and generation fidelity. Crucially, Seg4Diff requires no auxiliary segmentation head or pre-trained segmentation model. Our approach establishes a unified paradigm bridging visual perception and generative modeling, demonstrating that segmentation capability is inherently encoded in diffusion-based vision-language representations.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.
Problem

Research questions and friction points this paper is trying to address.

Understanding how attention maps contribute to image generation in diffusion transformers
Identifying specific layers that align text tokens with coherent image regions
Enhancing semantic grouping capabilities to improve segmentation and image fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies semantic grounding expert layer in MM-DiT
Applies lightweight fine-tuning with mask-annotated data
Amplifies emergent semantic grouping for improved performance
🔎 Similar Papers
No similar papers found.