EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

📅 2024-09-12

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing zero-shot personalized image generation methods struggle to balance fine-grained subject fidelity with precise text–image alignment, resulting in inaccurate subject encoding and inconsistent generation quality. To address this, we propose a fine-tuning-free zero-shot framework that innovatively repurposes the frozen Diffusion UNet itself as the subject encoder. We introduce a text–image guidance decoupling mechanism, leveraging multi-stage denoising scheduling and critical timestep revisiting to enhance subject transfer fidelity. Our method is architecture-agnostic—compatible with both SD2.1-base and SDXL—and achieves state-of-the-art performance across multiple benchmarks using only 1% of typical training data. When transferred to SDXL, it significantly improves subject consistency and generation fidelity without architectural modification, demonstrating strong generalizability and efficiency.

Technology Category

Application Category

📝 Abstract

Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to incorporate both sources of guidance. Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and imbalanced generation. In this study, we uncover key insights into overcoming such drawbacks, notably that 1) the choice of the subject image encoder critically influences subject identity preservation and training efficiency, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen, initially built upon SD2.1-base, achieved state-of-the-art performances on multiple personalized generation benchmarks with a unified model, while using 100 times less training data. Moreover, by further migrating our design to SDXL, EZIGen is proven to be a versatile model-agnostic solution for personalized generation. Demo Page: zichengduan.github.io/pages/EZIGen/index.html

Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot personalized image generation accuracy

Balancing text and subject guidance in image generation

Enhancing subject detail preservation with efficient training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses fixed pre-trained Diffusion UNet as encoder

Separates text and subject guidance stages

Revisits time steps to boost subject quality

🔎 Similar Papers

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance