VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of accurately retrieving traditional garments—such as the Vietnamese áo dài—whose nuanced structural and symbolic details often elude standard AI models. To bridge this gap, the authors introduce VietFashion, the first sketch-text multimodal retrieval benchmark specifically designed for cultural attire. The benchmark comprises 650 hand-drawn sketches and over 21,000 annotated images, featuring a multi-target retrieval mechanism to capture the inherent ambiguity in design intent. It further integrates generative models to augment photorealistic imagery and leverages fine-grained textual descriptions derived from fashion magazine corpora. Through a standardized evaluation protocol, the study exposes critical limitations of current methods in modeling fine-grained cultural semantics, thereby establishing a public challenge and foundational resource for future research in culturally aware fashion retrieval.

📝 Abstract

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

Problem

Research questions and friction points this paper is trying to address.

cultural garments

sketch-text composed retrieval

fine-grained fashion retrieval

multi-modal composition

visual retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

sketch-text composed retrieval

cultural fashion

multi-modal retrieval