🤖 AI Summary
Current text-to-image diffusion models require task-specific architectures and training pipelines for distinct tasks—including generation, inpainting, instruction-based editing, layout-guided synthesis, depth estimation, and referring segmentation—leading to redundancy and inefficiency.
Method: We propose UniDiff, the first unified multi-task diffusion model that supports all these tasks jointly using a single architecture and shared parameters. Its core innovations include: (i) unified multimodal conditional encoding across diverse input modalities; (ii) auxiliary visual tasks (depth estimation and referring segmentation) that enhance primary generative and editing capabilities via knowledge transfer; and (iii) multi-task mixed training, cross-task data balancing, and joint loss optimization.
Results: UniDiff achieves state-of-the-art or on-par performance with specialized models on key tasks—including text-to-image synthesis, inpainting, and instruction-based editing—while significantly improving depth estimation and referring segmentation accuracy, thereby validating the effectiveness and generalization advantage of its unified representation.
📝 Abstract
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.