🤖 AI Summary
Existing virtual try-on methods struggle to handle realistic, full-outfit scenarios involving multiple garments, fine-grained categories, layered combinations, and diverse styles. To address this gap, this work introduces the first large-scale multimodal dataset specifically designed for outfit-level virtual try-on, encompassing 40 major categories and over 300 fine-grained subcategories, with 80,000 outfit triplets. Each triplet includes 3–12 reference garment images, a corresponding in-the-wild model photograph, and detailed textual annotations. High-fidelity data are generated through a hybrid pipeline combining heuristic styling rules, image synthesis, automated filtering, and manual validation. The dataset is benchmarked using state-of-the-art models, revealing persistent challenges in garment layering, style consistency, spatial alignment, and artifact generation—highlighting the complexity and research significance of full-outfit virtual try-on.
📝 Abstract
Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.