๐ค AI Summary
Existing video virtual try-on methods suffer from spatial-temporal inconsistency and visual distortion in complex dynamic scenes, primarily due to large, irregular inpainting masks; mask-free approaches, in turn, struggle with precise localization of garment regions on moving human bodies. To address these limitations, we propose a mask-free, sparse point alignmentโguided framework featuring a novel point-enhanced guidance mechanism. Specifically, frame-to-garment point-set alignment (PSA) ensures accurate spatial garment transfer, while frame-to-frame point tracking (PTA) preserves temporal coherence. At its core lies the Point-Enhanced Transformer (PET), which integrates dedicated point-enhanced spatial and temporal attention modules. Extensive experiments on in-the-wild video benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, yielding more natural, temporally coherent, and visually superior virtual try-on results.
๐ Abstract
Video Virtual Try-on aims to seamlessly transfer a reference garment onto a target person in a video while preserving both visual fidelity and temporal coherence. Existing methods typically rely on inpainting masks to define the try-on area, enabling accurate garment transfer for simple scenes (e.g., in-shop videos). However, these mask-based approaches struggle with complex real-world scenarios, as overly large and inconsistent masks often destroy spatial-temporal information, leading to distorted results. Mask-free methods alleviate this issue but face challenges in accurately determining the try-on area, especially for videos with dynamic body movements. To address these limitations, we propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video Virtual Try-On framework that leverages sparse point alignments to explicitly guide garment transfer. Our key innovation is the introduction of point-enhanced guidance, which provides flexible and reliable control over both spatial-level garment transfer and temporal-level video coherence. Specifically, we design a Point-Enhanced Transformer (PET) with two core components: Point-Enhanced Spatial Attention (PSA), which uses frame-cloth point alignments to precisely guide garment transfer, and Point-Enhanced Temporal Attention (PTA), which leverages frame-frame point correspondences to enhance temporal coherence and ensure smooth transitions across frames. Extensive experiments demonstrate that our PEMF-VTO outperforms state-of-the-art methods, generating more natural, coherent, and visually appealing try-on videos, particularly for challenging in-the-wild scenarios.