🤖 AI Summary
Existing virtual try-on methods struggle to achieve fine-grained control over garment layout, often yielding results with limited diversity. This work proposes MOFA-VTON, the first approach to introduce a user sketch-driven dual-region masking mechanism combined with a cross-attention-based layout refinement module, enabling independent and precise spatial manipulation of upper- and lower-body garments. By integrating sketch-to-mask conversion with a region-aware generative network, MOFA-VTON overcomes the constraints of fixed-layout paradigms, facilitating interactive and high-fidelity virtual try-on. Extensive experiments on the VITON-HD and DressCode datasets demonstrate that MOFA-VTON significantly outperforms state-of-the-art methods, achieving notable improvements in outfit diversity, photorealism, and fashion expressiveness.
📝 Abstract
Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.