DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

📅 2023-05-24

🏛️ arXiv.org

📈 Citations: 16

✨ Influential: 3

career value

167K/year

🤖 AI Summary

To address the limited scalability of diffusion models for multimodal conditional generation, this paper proposes a lightweight adapter framework that requires no modification to the Stable Diffusion backbone parameters. Our core innovation is the first introduction of a three-way conditional channel separation architecture—comprising image-form, spatial token, and non-spatial token pathways—enabling plug-and-play integration of diverse modalities including text, sketches, bounding boxes, color palettes, and style embeddings. A modular conditional fusion mechanism and decoupled encoding design facilitate fine-grained cross-modal alignment and controllable generation. Evaluated on multimodal image synthesis tasks, our method substantially outperforms existing approaches, achieving state-of-the-art performance in both quantitative metrics and qualitative assessments. Notably, it is the first work to realize high-fidelity, composable, and scalable text-to-image generation with multiple conditioning signals—without retraining the backbone model.

📝 Abstract

In this study, we aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description, such as sketch, box, color palette, and style embedding, within a single model. We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types, i.e., image forms, spatial tokens, and non-spatial tokens. The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation. Notably, we achieve this without altering the parameters of the existing generative model, Stable Diffusion, only with updating partial components. Our study establishes new benchmarks in multimodal generation through quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender faithfully blends all the provided information and showcase its various applications in the detailed image synthesis.

Problem

Research questions and friction points this paper is trying to address.

Enhancing text-to-image generation with multiple modalities

Integrating structure, layout, and attribute inputs in diffusion models

Enabling multimodal conditioning without full model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal diffusion framework without full retraining

Categorizes inputs into structure, layout, attribute modalities

Updates only small subset of pretrained model components

🔎 Similar Papers

No similar papers found.