🤖 AI Summary
To address semantic entanglement between identity and background in personalized text-to-image generation—which impairs feature localization accuracy, identity fidelity, and output diversity—this paper proposes a StyleGAN W+-space-based identity-customization fine-tuning framework. Methodologically, it integrates diffusion models (DDPMs) while introducing two key innovations: (1) the first W+-space identity-background disentangled fine-tuning strategy, which prevents interference with global parameters; and (2) a fine-grained cross-attention module that fuses CLIP-guided text embeddings with multi-scale identity features to enable precise semantic localization and editing. Evaluated on multiple benchmarks, the method significantly improves identity consistency and image diversity without compromising DDPM’s generative capability. It achieves state-of-the-art performance across quantitative metrics—including FaceID and CLIP-Score—and demonstrates robust identity customization and controllable style editing.
📝 Abstract
Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.