PEMF-VTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

๐Ÿ“… 2024-12-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video virtual try-on methods suffer from spatial-temporal inconsistency and visual distortion in complex dynamic scenes, primarily due to large, irregular inpainting masks; mask-free approaches, in turn, struggle with precise localization of garment regions on moving human bodies. To address these limitations, we propose a mask-free, sparse point alignmentโ€“guided framework featuring a novel point-enhanced guidance mechanism. Specifically, frame-to-garment point-set alignment (PSA) ensures accurate spatial garment transfer, while frame-to-frame point tracking (PTA) preserves temporal coherence. At its core lies the Point-Enhanced Transformer (PET), which integrates dedicated point-enhanced spatial and temporal attention modules. Extensive experiments on in-the-wild video benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, yielding more natural, temporally coherent, and visually superior virtual try-on results.

Technology Category

Application Category

๐Ÿ“ Abstract
Video Virtual Try-on aims to seamlessly transfer a reference garment onto a target person in a video while preserving both visual fidelity and temporal coherence. Existing methods typically rely on inpainting masks to define the try-on area, enabling accurate garment transfer for simple scenes (e.g., in-shop videos). However, these mask-based approaches struggle with complex real-world scenarios, as overly large and inconsistent masks often destroy spatial-temporal information, leading to distorted results. Mask-free methods alleviate this issue but face challenges in accurately determining the try-on area, especially for videos with dynamic body movements. To address these limitations, we propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video Virtual Try-On framework that leverages sparse point alignments to explicitly guide garment transfer. Our key innovation is the introduction of point-enhanced guidance, which provides flexible and reliable control over both spatial-level garment transfer and temporal-level video coherence. Specifically, we design a Point-Enhanced Transformer (PET) with two core components: Point-Enhanced Spatial Attention (PSA), which uses frame-cloth point alignments to precisely guide garment transfer, and Point-Enhanced Temporal Attention (PTA), which leverages frame-frame point correspondences to enhance temporal coherence and ensure smooth transitions across frames. Extensive experiments demonstrate that our PEMF-VTO outperforms state-of-the-art methods, generating more natural, coherent, and visually appealing try-on videos, particularly for challenging in-the-wild scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhances video virtual try-on accuracy without masks.
Improves garment transfer in dynamic, real-world video scenarios.
Ensures temporal coherence and visual fidelity in try-on videos.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Point-Enhanced Mask-Free Video Virtual Try-On
Point-Enhanced Transformer with Spatial Attention
Frame-frame point correspondences for temporal coherence
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tianyu Chang
University of Science and Technology of China
X
Xiaohao Chen
Alibaba International Digital Commerce
Z
Zhichao Wei
Alibaba International Digital Commerce
X
Xuanpu Zhang
Tianjin University
Qing-Guo Chen
Qing-Guo Chen
alibaba-inc
machine learning
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Peipei Song
Peipei Song
University of Science and Technology of China
MultimediaComputer VisionMachine Learning
X
Xun Yang
University of Science and Technology of China