UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current text-to-image diffusion models require task-specific architectures and training pipelines for distinct tasks—including generation, inpainting, instruction-based editing, layout-guided synthesis, depth estimation, and referring segmentation—leading to redundancy and inefficiency. Method: We propose UniDiff, the first unified multi-task diffusion model that supports all these tasks jointly using a single architecture and shared parameters. Its core innovations include: (i) unified multimodal conditional encoding across diverse input modalities; (ii) auxiliary visual tasks (depth estimation and referring segmentation) that enhance primary generative and editing capabilities via knowledge transfer; and (iii) multi-task mixed training, cross-task data balancing, and joint loss optimization. Results: UniDiff achieves state-of-the-art or on-par performance with specialized models on key tasks—including text-to-image synthesis, inpainting, and instruction-based editing—while significantly improving depth estimation and referring segmentation accuracy, thereby validating the effectiveness and generalization advantage of its unified representation.

Technology Category

Application Category

📝 Abstract

Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Problem

Research questions and friction points this paper is trying to address.

Unified model for diverse image generation tasks

Single set of weights for multiple applications

Enhances image editing with auxiliary tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion model for multiple tasks

Single set of weights for diverse applications

Multi-modal inputs enhance image editing

🔎 Similar Papers

Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models