FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image generation struggles with overfitting and attribute leakage when modeling multiple similar concepts (e.g., several visually alike dogs). Method: This paper proposes a disentangled generation framework tailored for intra-class similar subjects. It introduces (i) concept-fusion data augmentation to enhance few-shot generalization; (ii) a localized refinement loss leveraging diffusion models’ fine-grained attention maps to decouple cross-concept representations and align semantic regions; and (iii) reference subject–background separation-recomposition combined with modular fine-tuning. Results: Experiments demonstrate that our approach significantly mitigates attribute leakage and overfitting while preserving photorealistic lighting and shading. It achieves state-of-the-art performance in multi-concept generation quality, outperforming existing methods both quantitatively and qualitatively.

Technology Category

Application Category

📝 Abstract
Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Overfitting in multi-concept text-to-image diffusion models
Attribute leakage among class-similar subjects
Limited diversity from small training samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Fusion augments data via subject-background recombination
Localized Refinement loss aligns attention maps precisely
FaR fine-tunes modules to balance new and old concepts
🔎 Similar Papers
No similar papers found.
G
Gia-Nghia Tran
University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
Q
Quang-Huy Che
University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
Trong-Tai Dam Vu
Trong-Tai Dam Vu
Unknown affiliation
B
Bich-Nga Pham
University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
Vinh-Tiep Nguyen
Vinh-Tiep Nguyen
University of Information Technology, VNU-HCMC
Deep learningComputer VisionInformation Retrieval
Trung-Nghia Le
Trung-Nghia Le
University of Science, VNU-HCM
Applied Deep LearningApplied Computer VisionMultimedia Security
Minh-Triet Tran
Minh-Triet Tran
University of Science & John von Neumann Institute, VNU-HCM
Cryptography and SecurityMultimedia and InteractionComputer Vision and Machine LearningSoftware Engineering