MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches treat topic-driven generation and instruction-driven editing as separate tasks, suffering from scarce high-quality annotations, poor generalization, and difficulty in preserving visual consistency. This paper proposes the first unified multimodal instruction-driven framework, modeling topic generation as “blank-canvas creation” and image editing as “constrained modification,” both conditioned on joint text-image instructions and trained end-to-end. Key contributions include: (1) the first unified multimodal instruction representation; (2) a cross-task feature fusion and knowledge transfer mechanism; and (3) the first support for fine-grained instruction-driven subject editing—a newly introduced task. Our method achieves state-of-the-art performance on both topic-driven generation and instruction-driven editing benchmarks, while significantly outperforming prior work on the new subject editing task. The code and models are publicly available.

Technology Category

Application Category

📝 Abstract
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Therefore, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation. MIGE introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism.This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: By leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: Learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a state-of-the-art in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.
Problem

Research questions and friction points this paper is trying to address.

Challenges in subject-driven image generation and instruction-based editing.
Limited high-quality data and poor generalization in existing methods.
Need for capturing complex visual variations while maintaining input-output consistency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multimodal instruction-based tasks
Novel multimodal encoder integrates vision-language features
Joint training enhances cross-task generalization and consistency
🔎 Similar Papers
No similar papers found.
Xueyun Tian
Xueyun Tian
Institute of Computing Technology
Multimodal GenerationMLLM
W
Wei Li
CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Bingbing Xu
Bingbing Xu
Associate professor, Institute of Computing Technology, Chinese Academy of Sciences
Graph Neural NetworksNetwork Embedding
Yige Yuan
Yige Yuan
Ph.D. student, Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningReinforcement Learning
Y
Yuanzhuo Wang
CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
H
Huawei Shen
CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China