Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses two key challenges in reverse virtual try-on (VTOFF): (1) difficulty in disentangling garment features due to occlusion and pose variation, and (2) poor category generalization—existing methods are typically restricted to single-garment categories. To this end, we propose TEMU-VTOFF, the first multi-category VTOFF framework. Methodologically, it employs a dual-DiT backbone with a multimodal attention mechanism integrating image, text, and mask inputs; further, it introduces a text-guided generation module and a geometric alignment module to enable fine-grained garment structure modeling and pose-invariant, tiled-style reconstruction. Evaluated on VITON-HD and Dress Code, TEMU-VTOFF achieves state-of-the-art performance, significantly improving visual quality and structural fidelity of generated images. Our framework establishes a new paradigm for garment product imagery synthesis and data augmentation in fashion applications.

Technology Category

Application Category

📝 Abstract

While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format -- typically a flat, lay-down-style representation of the garment -- making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.

Problem

Research questions and friction points this paper is trying to address.

Generating standardized product images from clothed individuals

Overcoming occlusion and pose challenges in garment feature extraction

Extending virtual try-off to multi-category garment applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual DiT-based backbone for garment extraction

Multimodal attention with text and images

Alignment module refines visual details

🔎 Similar Papers

No similar papers found.

Authors to Follow