1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual try-on (VTON) methods struggle to balance clothing detail fidelity and computational efficiency on high-resolution images/videos: dual-network architectures achieve superior performance but incur prohibitive overhead, whereas single-network approaches suffer from degraded detail synthesis. Targeting e-commerce applications, this work proposes the first efficient single-network VTON framework. Its core innovations are modality-specific normalization (MS-Norm) and a unified cross-modal attention mechanism, enabling shared attention layers across text, image, and video inputs. This design revitalizes the single-network paradigm—retaining lightweight inference while substantially improving fine-grained detail generation. Experiments demonstrate that our method outperforms state-of-the-art dual-network approaches on both image- and video-based VTON benchmarks. It achieves 2.1× faster inference speed and reduces GPU memory consumption by 37%, significantly enhancing scalability for high-resolution, long-sequence processing.

Technology Category

Application Category

📝 Abstract
Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary"ReferenceNet"to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.
Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On (VTON)
Clothing Detail Handling
High-Definition Video Compatibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

MNVTON
Single-Network Approach
High-Definition VTON
🔎 Similar Papers
No similar papers found.