Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual try-on methods rely on explicit garment image inputs, rendering them incapable of responding to pure text descriptions—a critical limitation for practical deployment. To address this, we propose the first Retrieval-Augmented Generation (RAG) framework for text-driven fashion image editing: given a user’s textual description, our method first employs Textual Inversion to achieve cross-modal alignment and retrieve semantically matched garment images; it then fuses their fine-grained visual features—such as texture, cut, and style—to guide diffusion-based editing. This work pioneers the integration of the RAG paradigm into multimodal fashion editing and introduces the first text-inversion–driven retrieval-generation synergy mechanism. Built upon Stable Diffusion, our approach jointly leverages cross-modal retrieval and diffusion model fine-tuning. Quantitative and qualitative evaluations on the Dress Code benchmark demonstrate substantial improvements over state-of-the-art methods, with high-fidelity reconstruction of garment details.

Technology Category

Application Category

📝 Abstract
In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.
Problem

Research questions and friction points this paper is trying to address.

Enables fashion customization via text input
Retrieves garments matching user specifications
Generates personalized images from retrieved items
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation for fashion editing
Textual inversion integrates garments into diffusion models
Multimodal inputs enable personalized fashion image generation
🔎 Similar Papers
No similar papers found.