IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address low subject fidelity, catastrophic forgetting, and overfitting simultaneously in subject-driven personalization of text-to-image diffusion models, this paper proposes a two-stage LoRA fine-tuning framework. In Stage I, SDXL generates generic scene images guided by class-label prompts; in Stage II, semantic segmentation–guided img2img injection transfers the subject from reference images with high fidelity. The core innovation lies in the first-ever decoupling of subject representation from scene composition—enabling subject encoding to operate independently of the model’s generative prior—thus achieving lightweight personalization without degrading SDXL’s original capabilities. Only attention-layer LoRA parameters are fine-tuned, avoiding full-model updates. On SDXL, our method achieves a DINO similarity score of 0.789, substantially outperforming existing approaches while preserving fine-grained detail accuracy and cross-scene generalization.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational overhead.We propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA weights.This framework isolates the subject encoding from the overall composition, thus preserving SDXL's broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.

Problem

Research questions and friction points this paper is trying to address.

Enhancing subject fidelity in personalized text-to-image generation

Overcoming catastrophic forgetting and overfitting in diffusion models

Reducing computational overhead in personalized image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-based fine-tuning for attention weights

Segmentation-driven Img2Img pipeline integration

Two-stage subject isolation and composition

🔎 Similar Papers

No similar papers found.